A runtime instrumentation stack for measuring perturbation dynamics, characterizing branchpoint geometry, and conducting falsifiable closed-loop control experiments during autoregressive generation — without modifying model weights.
v2 update (2026-04-19). The first version of this paper presented closed-loop stability control as the central contribution. Subsequent experiments falsified the original controller thesis on Qwen3-1.7B: scaling interventions at the final layer have zero effect (absorbed by RMSNorm), additive interventions are either silent no-ops or over-actuating depending on magnitude, and acting earlier in the residual stream cascades destructively. Most fundamentally, the divergence signal that drives the controller was found to measure token-level prose surprise — word-starts, semantic transitions, structural boundaries — rather than dynamical instability in any control-theoretic sense.
What survives is the instrument. The deterministic branchpoint design, the hysteresis protocol, the per-token telemetry, and the intervention engine all work as designed. Building on that instrument, we report a new positive result: a branchpoint hijacking phenomenon in which final-layer additive perturbations can flip individual tokens at predictable trajectory positions, with a within-model classifier achieving held-out AUROC of 0.82–0.86 on Qwen3-1.7B. The mechanism generalizes architecturally; the consequences (whether the resulting trajectory lands in a better or worse basin) are model-specific.
Sections rewritten in v2: §6 (spectral diagnostics, methodology corrected), §9 (controller empirical evaluation, replaced with falsification arc), §10 (experimental results, replaced with the Qwen3-1.7B controller arc), §11 (new — branchpoint hijacking), §12 (new — mapping program), §14 (limitations updated), §15 (conclusion rewritten). Sections preserved largely as written: §3 architecture, §4 SeedCache, §5 divergence signal mechanics, §7 hysteresis protocol, §8 intervention engine, §13 reproducibility infrastructure.
We present observer, an open-source runtime stack for studying perturbation dynamics in autoregressive language models. The system provides four independently usable protocol layers: a three-stage hysteresis protocol for measuring perturbation persistence, a single-pass observability runner with streaming diagnostics, a deterministic intervention engine built around a SeedCache branchpoint design that eliminates common confounds in intervention experiments, and a closed-loop adaptive controller that applies proportional damping in response to a per-token divergence signal. The divergence signal is derived from a VAR(1) model fit on projected hidden states, providing a held-out one-step prediction error.
We use this instrument to falsify our own initial closed-loop stability control hypothesis on Qwen3-1.7B and report what the experiments actually showed: the divergence signal correlates with token-level prose surprise (word-starts, structural boundaries, semantic transitions) rather than dynamical instability in a control-theoretic sense. Closed-loop control over this signal is not effective on the tested model. We then report a positive finding that emerged from the same apparatus: a branchpoint hijacking phenomenon in which final-layer additive perturbations can flip individual tokens at predictable positions, characterized by a within-model classifier achieving AUROC 0.82 on a procedural prompt and 0.86 on a descriptive prompt (Qwen3-1.7B, held-out pair-level evaluation). The classifier's predictive features differ between prompt classes, suggesting prompt-class-dependent trajectory geometry. We describe the architecture, the experimental protocols, the falsification arc, and a mapping program organized around three open questions with explicit stop conditions and controller-return criteria.
The dominant paradigm in mechanistic interpretability — sparse autoencoders, circuit discovery, logit lens analysis — answers the question "what does this model compute?" It is fundamentally a post-hoc analytical approach. The field has produced significant understanding of model internals, but has largely deferred a different class of question:
Can we detect when generation is destabilizing, in real time, and do something about it?
This is the question observer is built to answer. It is closer in spirit to control engineering than to interpretability research: rather than analyzing a system's internal structure, we treat the model as a dynamical system and ask whether we can build a feedback loop around it.
The practical stakes are not abstract. High-stakes deployments of language models — in agentic settings, long-horizon tasks, adversarial environments — require some answer to the question of whether generation has gone off course and whether that course can be corrected. The current state of the art is largely output-level heuristics: does the text look wrong? Observer proposes that the answer should be visible in the hidden trajectory before it surfaces in the output, and that a runtime controller can act on that signal.
Scope caveat (v1, retained for context): Observer was framed as a research instrument. The original paper hedged that the divergence signal measured trajectory instability and that empirical validation of downstream correlates was "the necessary next step."
Update (v2): we performed that validation. The divergence signal does not measure dynamical instability in the control-theoretic sense it was framed as measuring; it measures token-level prose surprise. See §10 for the falsification arc and §11 for the positive finding (branchpoint hijacking) that emerged from the same apparatus.
Observer occupies a space adjacent to several lines of existing work, without directly duplicating any of them.
TransformerLens (Nanda, 2022) provides the dominant toolkit for mechanistic interpretability research: model loading, hook-based activation capture and modification, and a large community of research built on its abstractions. It is an exploration tool — excellent for research notebooks and circuit analysis, not designed around systematic experimental protocols or recovery measurement.
pyvene (Wu et al., 2024) formalizes interventions as first-class serializable primitives, enabling composable intervention specifications across locations, granularity, and sequence position. It is an execution library: it provides the mechanics of intervention without opinions about experimental design, hysteresis, or recovery.
nnsight provides a Pythonic interface for local and remote model execution, including access to frontier models via the NDIF infrastructure. Observer supports nnsight as an optional backend, inheriting its remote execution capabilities.
The Representation Engineering paper (Zou et al., 2023) demonstrated that model behavioral tendencies can be read from and written to activation space via linear probes and steering vectors. The Inference-Time Intervention paper (Li et al., 2023) applied shifted activations at inference time, improving TruthfulQA performance from 32.5% to 65.1%. Neither line of work focused on recovery dynamics or closed-loop feedback.
Recent work on LLM output consistency (Raj et al., 2023; Huang et al., 2023) characterizes stability at the output level — how often does the same model produce the same answer across runs? Observer operates at a different layer: activation-level perturbation dynamics within a single generation, not output-level consistency across generations.
LinEAS (Rodriguez et al., NeurIPS 2025; arXiv:2503.10679) trains activation steering end-to-end with a global distributional loss, showing that locally tuned maps produce unintended downstream shifts when applied out-of-sample. Observer's adaptive controller is designed to detect and respond to such downstream cascades in real time.
FASB (Cheng et al., 2025; arXiv:2508.17621) dynamically determines intervention necessity and strength by tracking internal states during generation, with a backtracking mechanism to correct deviated tokens. Observer shares the adaptive framing but adds deterministic branchpointing and explicit recovery measurement, quantifying whether the trajectory recovered or remained shifted after intervention ended.
Grant et al. (2025; arXiv:2511.04638) provide a theoretical treatment of how causal interventions can push representations off the model's natural manifold, distinguishing benign null-space divergences from pernicious ones that activate dormant pathways. Observer's PLASTIC and DIVERGENT regime classifications can be interpreted through this taxonomy, offering empirical runtime signatures for divergence types their framework characterizes theoretically.
HARP (Hu et al., 2025; arXiv:2509.11536) decomposes hidden state space into semantic and reasoning subspaces via SVD of the unembedding layer, achieving AUROC 92.8% on TriviaQA hallucination detection. Observer's windowed SVD probe tracks effective rank dynamically within a generation rather than using static subspace decomposition for classification, a complementary signal.
HALT (Shapiro, Taneja, and Goel, Feb 2026; arXiv:2602.02888) treats token log-probability sequences as a time series for lightweight hallucination detection without requiring internal model access. Observer's VAR(1) predictor applies a related time-series framing to hidden state trajectories, a white-box signal that feeds an active intervention loop rather than a post-hoc detector.
Observer's distinguishing contribution is the combination of three things none of the above provide together: deterministic branchpointing (identical model state for both branches, eliminating confounds), recovery measurement (quantifying whether the trajectory returns to baseline after intervention ends), and closed-loop control (a token-level controller that detects instability and intervenes in real time). Existing tools execute interventions. Observer measures what they do, whether the model recovers, and acts on that signal continuously.
Observer is organized as four protocol layers, each independently usable and composable:
Three-stage protocol (BASE → PERTURB → REASK) for measuring perturbation persistence. Does the model self-correct when re-asked, given that the perturbation remains in the KV cache?
Single-pass token-level telemetry with streaming diagnostics: VAR(1) divergence predictor, spectral leakage metrics, layer stiffness, windowed SVD. No branching, no intervention.
Baseline vs. intervention comparison via SeedCache branchpoint. Both branches start from identical model state. Supports additive, projection, scaling, and SAE-based interventions.
Proportional controller with moving-average smoother and cooldown. Composite divergence score drives hidden-state scaling in real time. Shadow mode for calibration before active deployment.
PROMPT │ ▼ [ SeedCache: build_seed_cache() ] │ past_key_values snapshot │ next_token_logits │ seed_hidden @ intervention_layer │ ├──────────────────────────┐ ▼ ▼ [ BASELINE branch ] [ INTERVENTION branch ] SeedCache.clone() SeedCache.clone() greedy generation hook active: intervene() trajectory captured trajectory captured │ │ └──────────┬───────────────┘ ▼ [ TrajectoryComparison ] cosine distance per token JS divergence on logits regime classification recovery metrics
The central design problem in intervention experiments is confounding. A naive implementation runs the baseline and intervention branches from separate forward passes over the same prompt. This introduces at minimum: different random number generator states at the point of token sampling (even under greedy decoding, CUDA operations can have ordering nondeterminism), and potentially different attention mask states depending on the batching implementation.
The SeedCache resolves this by running the prompt exactly once, then cloning the resulting model state for both branches:
# Run prompt once, snapshot pre-generation state
def build_seed_cache(model, tokenizer, device, prompt, layer) -> SeedCache:
hook = _HiddenCaptureHook()
handle = layers[layer].register_forward_hook(hook)
with torch.no_grad():
outputs = model(input_ids, use_cache=True, return_dict=True)
handle.remove()
return SeedCache(
past_key_values = outputs.past_key_values, # full KV cache
next_token_logits = outputs.logits[:,-1,:], # first token dist
seed_hidden = hook.captured, # hidden @ layer
fingerprint = compute_cache_fingerprint(...) # checksum
)
# Both branches start from identical state
baseline_cache = seed_cache.clone()
intervention_cache = seed_cache.clone()
# SeedCache.clone() deep-copies past_key_values via clone_past_key_values()
# handles DynamicCache, legacy tuple-of-tuples, and generic objectscache.py
The fingerprint — derived from the first-layer key cache statistics — provides a checksum that experiments can log to verify both branches genuinely share a common origin. This is the kind of rigor that most published intervention papers treat as an implementation detail but actually matters for result validity.
Why this matters: Without a shared branchpoint, "recovery" measurements conflate genuine behavioral change with noise introduced by divergent initial conditions. The SeedCache makes the comparison meaningful.
The core signal feeding both the observability runner and the adaptive controller is a per-token held-out prediction error from a VAR(1) model fit on a sliding window of projected hidden states.
The hidden state ht ∈ ℝD (where D is the model's hidden dimension, typically 4096–8192) is projected to a fixed low-dimensional space via a deterministic Rademacher matrix:
The Rademacher projection preserves inner products in expectation (Johnson-Lindenstrauss), reduces the regression problem from D-dimensional to k-dimensional (k=64), and is computed once per hidden dimension via a seeded RNG — making it reproducible across runs and comparable across model families with different hidden sizes.
A first-order vector autoregressive model is fit on the sliding window W = {zt-n, ..., zt-1} via ridge regression:
Critically, the matrix A is fit on the window excluding the newest state zt. The prediction ẑt = zt-1 · A is then compared to the actual observed zt. This is a held-out evaluation: the model is never trained on the transition it is asked to predict. This matters because in-sample VAR(1) error on a short window would collapse toward zero regardless of actual trajectory instability.
The per-token scalar divergence combines normalized L2 error and cosine distance with a symmetric denominator to avoid blow-ups when projected norms are near zero:
When the hidden trajectory is locally predictable, the VAR(1) fit is good and divergence is low. When generation dynamics shift — through perturbation, distributional shift in the prompt context, or internal instability — the held-out prediction error increases. The signal is cheap: one matrix multiply per token in 64-dimensional space.
def step(self, hidden: torch.Tensor) -> float:
z = self._project(hidden) # (D,) → (64,)
self._window.add(z) # FIFO buffer, maxlen=8
if len(self._window) < 3:
return 0.0
states = self._window.matrix() # (T, 64)
train = states[:-1, :] # exclude newest
A = _fit_var1_ridge(train) # fit on T-1 transitions
pred = states[-2, :] @ A # predict from t-1
actual = states[-1, :] # held-out: actual t
return _divergence(pred, actual)["combined"]predictor.py
The divergence signal is the primary input to the controller, but the V1.5 observability runner and the adaptive controller also compute three supplementary diagnostics that provide corroborating signal and richer telemetry for offline analysis.
v2 correction. The original spectral module FFT'd the flattened hidden-state vector along the feature-index axis and reported entropy, flatness, centroid, and band fractions over that spectrum. The v1 paper acknowledged that "the feature index is not a temporal axis" but defended the metrics as a stable characterization of activation energy distribution. This defense does not survive a permutation test: neuron ordering in transformer hidden states is arbitrary (a function of weight initialization, not semantics), and any neuron-axis FFT summary is a function of that arbitrary ordering. Permuting neurons changes every reported metric; the underlying activation is unchanged.
The v2 implementation rewrites this module as a token-time spectral probe:
hidden states are accumulated into a sliding window of shape [T, D]
and the FFT is taken along the time axis (dim=0). Per-frequency power is then
averaged across the D dimensions, producing a scalar trajectory spectrum. This
captures real structure — slow drift vs. high-frequency oscillation in
activation patterns across generation steps — and is invariant to neuron
permutation. A built-in self-test reports a non-zero permutation-change ratio
whenever the window has at least 8 tokens, confirming the time axis is in fact
what's being analyzed.
The corrected metrics, computed on the time-axis trajectory spectrum, are:
| Metric | Description |
|---|---|
spectral_entropy | Normalized Shannon entropy of the time-axis power spectrum. High = energy spread across slow and fast trajectory frequencies. |
spectral_flatness | Geometric mean / arithmetic mean of power. Approaches 1.0 for white-noise trajectories, 0.0 for tonally pure ones. |
centroid | Normalized frequency centroid ∈ [0,1]. High centroid = trajectory dominated by step-to-step oscillation rather than slow drift. |
high_frac | Fraction of power in the upper 20% of trajectory frequencies. |
rolloff_85 | Normalized frequency below which 85% of cumulative power falls. |
permutation_change | (new in v2) Self-test ratio comparing the spectrum of the actual trajectory to the spectrum of a randomly time-permuted version of the same window. Should be > 0 — confirms time-axis behavior. If a future regression makes this near zero we know the module has reverted to neuron-axis behavior. |
Empirically, permutation_change turned out to be the strongest single
feature for predicting branchpoint flippability on Qwen3-1.7B in §11.5 — a feature
that was conceived as a methodology self-test ended up carrying real signal about
trajectory geometry.
A window of hidden vectors {ht-w, ..., ht} ∈ ℝW×D is stacked into a matrix X and its singular value decomposition computed via the Gram trick: eigenvalues of XXT (a W×W matrix with small W) yield the squared singular values without requiring the full D×D computation. An SVD of a single vector returns only the vector norm — uninformative. The windowed approach captures the local rank structure of the trajectory: whether the model is moving through a low-dimensional manifold or exploring higher-dimensional space.
Effective rank is computed as exp(H(p)) where p is the normalized singular value distribution — the exponential of the entropy of squared singular values. A drop in effective rank signals that the trajectory is collapsing onto a lower-dimensional subspace, a potential precursor to repetition or mode collapse.
At three probed layers (early / mid / late), the velocity norm vt = ||htL − ht-1L||2 is tracked over a sliding window. Mean velocity defines stiffness; the linear slope of velocity over the window defines stiffness trend. Elasticity = 1/(1 + stiffness) provides a bounded stability score in (0,1]. This is a diagnostic proxy, not a physical quantity.
The baseline hysteresis module implements a three-stage experimental protocol for measuring how much of a perturbation's effect persists after the perturbation is removed.
Stage 1: BASE ───────────────────────────────────────────────────── Prompt → SeedCache → greedy generation Capture: hidden_norm, entropy, logit_norm, SVD spectrum Stage 2: PERTURB ───────────────────────────────────────────────────── Same SeedCache + Delta instruction injected Capture same statistics KV cache retained for Stage 3 Stage 3: REASK ───────────────────────────────────────────────────── Continue from PERTURB's KV cache Minimal re-ask (no repeated prompt) Perturbation still in context; does model return to BASE? Metrics: D = composite distance(BASE, PERTURB) ← drift H = composite distance(BASE, REASK) ← hysteresis R = 1 - H / (D + ε) ← recovery ∈ (-∞, 1]
The composite distance used in the metric computation draws on four component signals: relative hidden norm difference, entropy distance, relative logit norm difference, and SVD spectral distance (normalized L2 between top singular value vectors). These are combined with equal weights except logit norm (0.5×), reflecting that hidden state geometry carries more signal than logit magnitude.
Recovery R is classified into four regimes:
R > 0.8. Model substantially returns to baseline behavior despite perturbation remaining in context.
0.4 < R ≤ 0.8. Partial recovery; residual perturbation effect visible in trajectory statistics.
0 ≤ R ≤ 0.4. Perturbation effect persists significantly. Model has been durably steered.
R < 0. REASK is further from BASE than PERTURB was. Perturbation has amplified rather than decayed.
This taxonomy provides vocabulary for characterizing perturbation experiments that the field currently lacks. Whether a given prompt-perturbation pair produces elastic, plastic, or divergent behavior is a property of the model that is currently unknown for most practically relevant perturbation types.
The intervention engine is the core experimental workhorse. It runs baseline and intervention branches from a shared SeedCache, captures full hidden trajectories from both, and computes a rich set of comparison metrics.
| Type | Operation | Parameters |
|---|---|---|
| additive | Add a unit random vector scaled by magnitude to last-token hidden state. | magnitude, seed |
| projection | Project out a random k-dimensional subspace: h ← h (I − QQT) | subspace_dim, seed |
| scaling | Multiply last-token hidden state by scalar s. | scale |
| sae | Steer along SAE decoder column for a specified feature index. | sae_repo, feature_idx, strength |
Hooks are registered with register_forward_hook and removed in finally
blocks. Critically, the intervention is applied before the hook captures the hidden state —
so the captured tensor reflects what downstream layers actually receive, not the pre-intervention value.
This is the correct ordering that many published implementations miss.
The TrajectoryComparison object implements a layered fallback strategy for
computing per-token distances between branches:
The primary metric is cosine distance on actual hidden vectors (preferred). If hooks fail to attach and hidden vectors are unavailable, it falls back to Jensen-Shannon divergence on the logit distributions. If logits are also unavailable, it falls back to normalized L2 on hidden norms. The code documents this explicitly: "hidden_norm alone is not sufficient — the same norm can hide large vector changes."
Recovery is computed over the post-intervention window: deviation_during (mean primary metric during active intervention), final_distance (primary metric at final token), recovery_ratio = (deviation_during − final_distance) / deviation_during, and convergence_rate (negative slope of primary metric over post-intervention tokens via linear fit).
The adaptive controller closes the loop: per-token diagnostics drive a proportional scaling intervention that damps the hidden state when the composite score exceeds a threshold.
v2 status. The architecture in this section is unchanged from v1, and the implementation runs as described. What changed is the empirical story. The controller is no longer presented as a working component with interesting attractor-selection behavior; it is presented as a falsifiable hypothesis that we falsified. §10 reports the failure modes (silent no-op at L=−1 with scaling, over-actuation at higher additive magnitudes, destructive cascade when acting earlier in the stack). The controller code remains in the repository as research scaffolding for a future redesign — the criteria under which controller research would resume are listed in §11.7.
The spectral and SVD terms are gated — they only contribute when they exceed a baseline level (spectral entropy above 0.75, high-frequency fraction above 0.30), to avoid penalizing normal variation. The rank delta term detects sudden changes in trajectory dimensionality.
A 3-token moving average of the score is computed. When the smoothed score exceeds a threshold, the controller applies a scaling intervention to the last-token hidden state at the monitored layer, then enters a cooldown period during which the scale is held and further threshold evaluations are suppressed:
| Status | Condition | Scale Applied | Cooldown |
|---|---|---|---|
| STABLE | avg_score ≤ 0.55 | 1.0 (no intervention) | — |
| WARNING | 0.55 < avg_score ≤ 0.85 | 0.90 | 3 tokens |
| CRITICAL | avg_score > 0.85 | 0.75 | 6 tokens |
| COOLDOWN | Post-intervention hold | Held from trigger | Counting down |
The scaling intervention multiplies the hidden state: ht ← s · ht. This reduces the magnitude of the current representation, which typically reduces the entropy of the downstream logit distribution and pulls the model toward its modal behavior. The mechanism is simple and its effects are legible — a deliberate choice given that the controller is a research instrument, not a production component.
When --shadow is set, the controller observes and logs its decisions
but does not apply the scaling hook. This allows calibration of threshold and weight
parameters on a given model and prompt distribution before active deployment.
The separation of observation and actuation is explicit in the code:
if (scale_used < 1.0) and (not shadow): hook.set_active(True).
v2 replacement. The v1 §10 reported two experiment families on
Qwen2.5-7B that, in retrospect, lacked controls we now know are required: per-step
intervention-applied counts, decoupled measure/act layers, and pair-level
shadow/active comparison. The v1 headline ("controller aggressiveness determines
which attractor the model lands in") was reported as a controller property but is
consistent with two alternative explanations we could not rule out at the time:
Qwen2.5-specific scaling behavior, and prompt-seed-specific basin topology
independent of the controller. The v1 results are preserved in the repository's
RESEARCH_CONTROLLER.md archive for historical context.
In v2 we ran a longer arc on Qwen3-1.7B (28 layers) with explicit controls. We report it here in the order the experiments happened, including the negative results, because the negative results determine the structure of the rest of the paper. The model used throughout this section is Qwen3-1.7B; generalization to other architectures is treated as out of scope for the current mapping program (one cross-model scope-check is reported in §11.5).
On the prompt "Write step-by-step instructions for baking sourdough bread.", the closed-loop controller at its v1 design defaults (act_layer=−1, intervention_type= scaling, scale_warn=0.90, scale_crit=0.75) produced essentially the same per-token divergence as shadow mode. avg_raw_div: 0.765 (shadow) vs 0.768 (active). Warning counts: 8 vs 7. Critical counts: 1 vs 1. With 5 seeds × 1 prompt the difference is well within seed variance. This was the first signal that something in the pipeline did not work as advertised.
A diagnostic stress run isolated the cause. With scale=0.5 (halving the
final-layer hidden state) we measured logit_kl_mean_during = 0.0000 across 5 seeds.
With scale=2.0 (doubling), also 0.0000. token_match_rate = 1.000 in
both cases. The scaling intervention was a true no-op at L=−1.
The mechanism is structural: Qwen3 places an RMSNorm between the last transformer block and the LM head. Scaling the input to that norm by any constant factor is erased — the norm rescales to unit variance, the LM head sees an essentially identical input, and the argmax is unchanged. The closed loop in F4 was firing the controller (the trigger was active), the scaling intervention was applied, and the intervention had no downstream effect. The closed loop was open at the actuator.
A 4-way intervention-type comparison at the same layer (additive, scaling@0.5,
scaling@2.0, projection-onto-64-dim-subspace) clarified what does work. Additive
perturbation with relative magnitude 1.0 produced
logit_kl_mean_during = 10.40 ± 2.78 across 5 seeds (DSR = 3.73), with
token_match_rate = 0.145 — that is, 85% of generated tokens differed from baseline.
Projection produced larger logit shifts but always landed in a runaway regime.
Scaling at any magnitude produced exactly zero. The conclusion was that
at L=−1 on Qwen3-1.7B, additive perturbation is the only intervention class
that reliably reaches the LM head decision distribution.
Returning to the F4 setup with additive replacing scaling and a random seeded direction: shadow vs. active avg_raw_div = 0.6732 vs 0.6728 (Δ = +0.0004 across 5 seeds). Per-seed, the active runs split — some flipped a few tokens and improved divergence, others flipped tokens and worsened it. The expected value of a random direction was zero, and that is what the data showed. This was the first direct evidence that the controller's actuator was now reaching the model (token_match_rate ≠ 1.000 on most seeds) but its effect was undirected.
Two follow-up experiments replaced the random additive direction with a drift-
opposing one. The implementation maintains a reference hidden state (either an
EMA of recent hidden states or a frozen anchor from the first N clean
tokens), computes drift = h_current − h_reference at each step,
normalizes, and injects −β·drift_direction as the corrective additive
delta. The intuition matches a textbook proportional controller pulling toward a
setpoint.
EMA reference (F23): avg_raw_div 0.6547 vs shadow 0.6732 — a 2.74% improvement, 0.106 shadow-stdevs. Anchor reference (F24): 0.6537, an additional 0.001 improvement. Both improvements are concentrated in a single seed where the controller's first intervention coincided with a token-flip that diverted the trajectory into a more coherent basin (the F25 phenomenon, characterized below). Other seeds were unchanged or slightly worse.
Per-step trace analysis of F23/F24 revealed that the small aggregate improvements were not closed-loop control. On one of the F24 seeds, the controller's first intervention at step 7 flipped a single token (a leading whitespace became "Use"), and the subsequent generation entered an entirely different output basin — a coherent recipe with explicit ingredients ("100g flour, 100g water, 100g…") in place of shadow's degenerate numbered-list stub ("1. Prepare 2. Mix 3. Let…"). On other seeds with already-coherent baselines, the controller fired repeatedly, flipped no tokens, and active output was character-identical to shadow.
This pattern is reproducible: small final-layer additive perturbations can flip individual tokens at branchpoints where the LM head's top-2 logit margin is small, and the resulting trajectory enters a different attractor. The visible "controller helps" effect is one such hijack landing in a better basin. The visible "controller does nothing" effect is the controller firing in regions where its perturbation is smaller than the local logit margin. The controller is not stabilizing trajectories. It is occasionally redirecting them at branchpoints, and whether the new basin is better or worse is a property of the basin, not the controller.
A first attempt to fix F25 by moving the actuation layer one step back (act_layer
= −2) produced what looked like a clean 1.7σ improvement on aggregate avg_raw_div.
Inspection of intervention_applied counts in events.jsonl showed the
controller fired exactly 1 time across 5 active runs. The "improvement" was
entirely an artifact of moving measure_layer from −1 to −2 in the
same step (the original implementation required them to be equal). Divergence at
L=−2 is naturally lower than at L=−1, and the comparison was apples-to-oranges.
The controller was a spectator.
Resolving this required a small code change: decouple measure_layer
from act_layer in the runtime engine, with separate forward hooks for
measurement (capture-only) and actuation (modify). Once decoupled, the layer-move
hypothesis could be tested honestly.
With measure_layer=−1 held fixed (identical signal to shadow) and
act_layer ∈ {−1, −2, −3}, on the same 5-seed sourdough suite:
| Cell | avg_raw_div | Δ vs shadow | interventions fired |
|---|---|---|---|
| SHADOW | 0.6732 ± 0.174 | — | 0 |
| ACT_L−1 | 0.6537 | +0.020 (+0.11σ) | 19 |
| ACT_L−2 | 0.7057 | −0.033 (−0.19σ) | 22 |
| ACT_L−3 | 0.6918 | −0.019 (−0.11σ) | 25 |
Acting one layer back makes things worse, not better. Acting two layers back is also worse. The controller is firing at similar rates in all three configurations (~20 interventions per 5-seed suite), so this is not a "intervention never fires" artifact. The interpretation is that perturbations earlier in the residual stream cascade through subsequent attention and MLP layers, accumulating drift rather than steering it. This closes out the layer-placement direction of controller redesign.
The cumulative result of the controller arc. Across nine findings (F4, F17, F18, F21, F22, F23, F24, F26, F27), every controller variant either (a) does nothing because the actuator is absorbed by an intervening norm layer, (b) does nothing because the perturbation is smaller than local logit margins, (c) opportunistically hijacks one branchpoint per seed and otherwise does nothing, or (d) is destabilizing rather than stabilizing.
The simplest explanation that fits all of these is that the divergence signal we are measuring does not measure what closed-loop stability control needs it to measure. Inspection of the highest-divergence steps in observe runs (replicated from v1's "divergence spikes at structural boundaries" finding, which we confirm) shows that the signal spikes at word-starts, punctuation, the transitions between numbered list items, and the boundaries between semantic units. These are normal features of well-formed prose, not symptoms of trajectory destabilization. Closed-loop intervention on this signal is therefore a controller fighting prose structure.
The instrument is sound. The trigger signal is not what it was framed as. The naive controller redesign space (varying intervention class, magnitude, direction, layer, reference rule) is exhausted within the observe-run regime we can support on commodity hardware.
The controller arc was a falsified hypothesis. The same apparatus produced an unfalsified mechanism worth reporting. We call it branchpoint hijacking: additive perturbations applied to the final transformer layer can flip individual tokens at predictable positions in the generation sequence. The mechanism reaches the LM head (unlike scaling, which is absorbed by RMSNorm), it is reproducible across seeds, and it generalizes architecturally. Whether a flip improves or degrades the resulting output is model- and prompt-specific — that is the consequence-side limit, not a flaw in the mechanism.
A small number of additive perturbations applied during a stress run can produce one or more single-token flips at positions where the LM head's argmax is margin-vulnerable. Once a single token has been flipped, subsequent tokens are drawn from a different conditional distribution, and the trajectory enters a different attractor with its own local dynamics. The visible footprint of a successful hijack on the F23/F24 sourdough runs is one token of difference between shadow and active output, followed by a continuation that is structurally and semantically distinct.
A scope check on TinyLlama-1.1B (Llama family architecture, vs. Qwen-family for
Qwen3-1.7B) using the same configuration confirmed that the flip mechanism
generalizes. On 3 of 5 seeds with non-degenerate generation, the active cell
flipped tokens vs shadow with the controller firing 9–10 times per seed. However,
on TinyLlama every hijacked seed landed in a worse basin: avg_raw_div was higher
in active than shadow on all three (e.g., seed 2: 1.006 → 1.066, output became
"Pleaseincludeingredients,measurements,bakingtime…" with whitespace
tokens dropped). This contrasts with Qwen3 sourdough seed 2, where the same
mechanism landed the trajectory in a coherent recipe basin.
The conclusion: the perturbation-induced branchpoint flip is architecture-general; whether the new basin is better or worse is a property of the model's basin structure for that prompt, not a property of the perturbation. F25 was effectively two claims in one. The mechanism replicates; the consequence does not.
A natural follow-up question: given a step in clean (unperturbed) generation, can we predict whether a small perturbation at that step would flip the next token? We address this offline using existing control-mode runs. For each pair of matched shadow/active runs that share (model, prompt, seed, max_tokens, temperature, measure_layer, act_layer) and where the active run actually fires the controller, we construct per-step training rows: features extracted from the shadow trajectory at step t, label 1 if the active and shadow tokens at step t differ.
Critical methodology note: features must come from the shadow trajectory. An
earlier analyzer pulled features from the active trajectory and reported
AUROC = 0.80, but the active trajectory contains intervention-downstream fields
(intervention_applied, scale_used,
controller_drift_norm) that are tautologically correlated with the
flip label — non-zero only on steps where the controller fired. The 0.80 was a
data leak, not a predictor.
With shadow-trajectory features only, on Qwen3-1.7B with pair-level train/test split (80/20) and matched configurations:
| Prompt | Pairs | Step rows | Held-out AUROC |
|---|---|---|---|
| Sourdough (procedural) | 12 | 576 | 0.82 |
| Water cycle (descriptive) | 5 | 240 | 0.86 |
Both prompts independently clear an AUROC threshold of 0.80 within Qwen3-1.7B. The features that drive the prediction differ between prompt classes:
| Feature | Sourdough sign | Water cycle sign | Mechanistic interpretation |
|---|---|---|---|
step_idx | + (later) | + (later, AUROC 0.80 alone) | Baseline divergence accumulates with sequence length; later tokens have more chances to be at margin. |
spectral.permutation_change | + (high) | − (low) | Time-axis spectral structure; sign flips between procedural and descriptive prompts. |
layer_stiffness.−1.elasticity | + (low velocity) | + | Slow-moving final-layer activations are at branchpoints more often. |
svd.top1_energy_frac | + (concentrated) | — | Trajectory concentrated in dominant direction — local instability. |
spectral.total_power | − (high) | − | High-energy steps are less flippable. |
The sign flip on spectral.permutation_change between procedural and
descriptive generation is the most interesting line item. It suggests Qwen3-1.7B
drives different trajectory geometries for different prompt classes, and the
geometric signature of a "branchpoint" depends on which class. The only feature
with a consistent sign across both prompts is step_idx — generation
position — which is the universal-but-weak predictor.
The positive finding, stated plainly. Within Qwen3-1.7B, given a step of clean generation, we can predict at AUROC ≥ 0.80 whether a small additive perturbation at that step would flip the next token. The features driving the prediction are geometric properties of the unperturbed trajectory at that step, and those features are prompt-class-dependent. The mechanism by which the flip occurs (additive perturbation reaching the LM head's argmax margin) generalizes to a Llama-family model; whether the resulting trajectory is better or worse does not.
Following the controller arc's falsification, the project's center of gravity shifts from "build a closed-loop controller" to "map the geometry of trajectory sensitivity, persistence, and branchpoint behavior in Qwen3-1.7B." This is a deliberately scoped agenda — single model, mechanistic questions, explicit stop conditions per question — designed to keep the work falsifiable and finite.
| Question | Stop condition | Status | |
|---|---|---|---|
| Q1 | Branchpoint geometry: when are tokens flippable? | Within-Qwen3 held-out AUROC ≥ 0.80 across ≥2 prompts. | Closed (F31) |
| Q2 | Perturbation propagation: how does an injected delta evolve through the residual stream? | Per-layer propagation curves that mechanistically explain F27. | Open |
| Q3 | Basin structure: when does a flip improve vs. degrade output? | Pre-flip generation feature predicts improve-vs-degrade with AUROC ≥ 0.7 on Qwen3-1.7B. | Open (F29 is one cross-model data point) |
The closed-loop controller research direction is paused, not abandoned. We explicitly define the conditions under which it would be reasonable to reopen: the controller returns to active investigation when any two of the following become true.
act_layer instead of the current trial-and-error).If two of these land, a controller redesign experiment using the new trigger, layer, and gate becomes worth running. If none land within the mapping program, the mechanistic findings (F25, F29, F31, plus Q2 and Q3 results) stand on their own as an interpretability contribution and the controller stays paused.
What does not justify reopening the controller is more tuning of the
existing design space. The combinations
(scaling | additive-random | additive-EMA-opposing | additive-anchor-opposing)
× (L=−1, L=−2, L=−3) × (magnitude ∈ {0.3, 0.6, 0.8, 1.2})
have all been tested and are recorded in §10 and the
RESEARCH_CONTROLLER.md archive. Reopening this space without new
information from the mapping work would be blind tuning.
Observer is designed to produce artifacts that can support publishable claims, not just exploratory analysis. The compute environment for the reported experiments is a single NVIDIA H200 GPU via RunPod.
Every run produces a config hash (SHA-256 of the full experiment configuration, sorted-key JSON) and a seed cache fingerprint (statistics of the first-layer key cache). These allow reconstruction of run identity and verification that two runs claiming to share a branchpoint actually do. The full experiment configuration, trajectory data, and computed metrics are written to structured JSON artifacts for every run.
The included REPRODUCIBILITY.md specifies a reporting checklist for public claims:
pin commit hash in every figure caption; report model key, backend, seed, and intervention settings;
run at least 3 seeds per comparison; report mean + confidence interval, not best run;
publish raw results.json used for plots. This is the standard that is routinely
absent from published intervention work.
The CI workflow runs a compileall pass over the active runtime
package and the orchestrator scripts on every push, plus a unittest
discover pass that exercises a small suite of guard tests. The guard tests
catch the kind of integration regression a previous Codex audit caught manually:
CLI semantic-layer parsing, branchpoint-analyzer default values, README quickstart
pointing at the unified runtime entry points rather than the legacy v1/v1.5/v2
scripts. The repository keeps two top-level research documents:
RESEARCH.md tracks the active mapping program and is the entry point
for any new session, and RESEARCH_CONTROLLER.md archives the
completed controller arc with full F-numbered findings so subsequent agents can
cite established facts without re-deriving them. docs/RESEARCH_WORKFLOW.md
documents the experiment handoff protocol both documents follow.
Several limitations constrain v2's claims, in addition to the central limitation documented in §10 (the divergence trigger does not measure what closed-loop stability needs it to measure).
Single-model scope. The mapping program in §12 is explicitly scoped
to Qwen3-1.7B. F25 Part A (the flip mechanism) replicates on TinyLlama-1.1B (F29);
F25 Part B (the basin direction) does not. F31's branchpoint predictor is within-
Qwen3 only. Cross-model generalization is a deferred research question, not a
settled one. Anyone applying these results to a different architecture should
expect F31's specific predictive features (e.g., the sign of
spectral.permutation_change) to need recalibration per model.
Prompt-class diversity. F31 closes Q1 on Qwen3 with two prompts — one procedural ("Write step-by-step instructions for baking sourdough bread.") and one descriptive ("Describe the water cycle in a few sentences.") — and finds that predictive features differ between them. Three-or-more prompt classes (reasoning, code, creative) would harden the prompt-class-dependent geometry claim and is an immediate Q1-extension experiment.
Sampling. v1 noted that all generation used greedy argmax. v2
added temperature/top-p/top-k sampling support, but a subtle finding emerged: with
matched seeds, torch.multinomial can produce the same drawn token
from slightly different conditional distributions, so logit-shift effects are
masked at the token level even when present. Future work should use unmatched
branch seeds when measuring perturbation effects on sampling-mode generation.
Controller-mode logit features not logged. A natural extension of
the F31 branchpoint predictor would include top-2 logit margin and per-step logit
entropy as features. Both are architecture-invariant and should be the most
direct mechanistic predictors of flippability. Neither is currently written into
events.jsonl; adding them is queued as a small code task and would
enable a more principled cross-model F31 follow-up.
Asserted controller weights. The 70/15/10/5 weighting in the v1 composite score (§9) was a design choice, not derived from empirical optimization. In v2 this is moot — the composite-score-driven controller is paused as a class — but if controller research resumes the weighting choice should be re-derived from the new trigger signal rather than carried forward.
Architecture coverage. Layer discovery currently handles Llama/
Qwen-style (model.model.layers), GPT-2/GPT-J
(transformer.h), GPT-NeoX (gpt_neox.layers), and
encoder-decoder (model.decoder.layers). Models tested in this paper:
Qwen3-1.7B (28 layers, primary) and TinyLlama-1.1B-Chat (22 layers, scope check
only). Falcon, Mistral (sliding window attention), Gemma, Phi, and Mamba would
require additional handling and are out of scope for this preprint.
VAR(1) window constraints. With window size 8, the VAR(1) model is fit on 7 transitions in 64-dimensional space. The ridge regularization (λ=0.01) stabilizes the regression, but statistical power of the prediction error signal is limited, particularly in the first few tokens before the window fills. This is part of why F28's "divergence measures prose surprise" framing makes sense: the predictor is fitting short-window local trajectory dynamics, which are legitimately disrupted at semantic-unit boundaries in normal generation.
This project was developed by an independent researcher without formal ML or software engineering training, with no prior programming experience, and without institutional funding. Implementation was carried out through iterative AI-assisted coding workflows over evenings and weekends on rented compute.
We report this as methodological context. The research questions, experimental decisions, and acceptance criteria were set by the author; AI assistants provided implementation support for code generation and revision. All code, runs, and claims were human-reviewed against run artifacts before inclusion in this paper.
v1 listed three planned validation experiments. v2 reports what happened to each.
v1 Experiment A (minimal downstream correlation). Status: partially answered, with a different answer than expected. The intended test was "do divergence statistics differ between correct and incorrect outputs." What we found instead is that divergence reliably spikes at structural boundaries in well-formed prose — paragraph breaks, semantic transitions, numbered list markers, word-starts. Most "high divergence" steps are not associated with incorrect output; they are associated with normal writing. This redirected the research from "use divergence as a hallucination detector" to "characterize what divergence actually measures" (F28).
v1 Experiment B (attractor-basin replication). Status: did not replicate as a controller-property claim. v1's headline result on Qwen2.5-7B (controller aggressiveness selects which incorrect-claim attractor the model lands in) could not be reproduced on Qwen3-1.7B because scaling at L=−1 has zero effect on Qwen3 (F17/F22). The basin-selection phenomenon may still be real on Qwen2.5-7B; we cannot confirm or deny without a re-run with the measurement controls v2 added. We do not currently plan that re-run because the mechanistic finding (F25 branchpoint hijacking) supersedes it as a more general and more measurable phenomenon.
v1 Experiment C (signal baseline comparison). Status:
partially superseded. The intended test was "does VAR(1) divergence
outperform simpler signals." The v2 work did not run that comparison head-to-head,
but F31's univariate-feature AUROCs offer a partial answer: the strongest single
feature for branchpoint flippability is spectral.permutation_change
(a time-axis spectral statistic, AUROC 0.74 alone) or step_idx
(literally token position, AUROC 0.80 alone on water-cycle). Neither is the
VAR(1) divergence signal. This is a gentle hint that simpler signals may carry
much of what divergence carries, and a head-to-head benchmark is worth doing.
Queued as a follow-up.
New v2 work queued. Q2 (perturbation propagation) and Q3 (basin structure) — both defined in §12 with stop conditions — are the next planned experiments. Q2 is ~1 hour of offline analysis on existing stress runs; Q3 requires a small new experiment matrix (5 prompt classes × 3 seeds × 2 cells = 30 control runs, ~5 minutes with the warm-model daemon). Closing both within the mapping program would either trigger the controller-return criteria in §12 or establish a clean negative result on closed-loop stability for this model class.
Observer started as an attempt to build a closed-loop stability controller for autoregressive language model generation: detect destabilization in real time, apply proportional damping, observe recovery. The instrument we built does most of what we set out to build — deterministic branchpointing, per-token telemetry, hooked interventions, real-time controller logic — but the central control claim did not survive contact with our own validation experiments. The trigger signal we had been calling "trajectory instability" turned out to measure token-level prose surprise: word-starts, semantic transitions, structural boundaries. A controller built on that signal is, in effect, fighting normal writing.
Reporting the falsification matters. The v1 paper presented closed-loop control as a working contribution, with a striking attractor-selection result on Qwen2.5-7B that, we now suspect, lacked the controls to rule out simpler explanations. The v2 work makes the central claim falsifiable, runs the falsification, and reports it. That is the value of building a research instrument before claiming a research result with it.
What remains is a useful instrument and one positive mechanistic finding. The instrument — SeedCache branchpointing, the unified runtime, the warm daemon, the decoupled measure/act layer hooks, the corrected token-time spectral probe — is sound and reusable. The positive finding — branchpoint hijacking on Qwen3-1.7B with a within-model AUROC predictor that clears 0.80 on two prompt classes — is a concrete interpretability result that the field could build on. The mapping program in §12 organizes what comes next, with explicit stop conditions and explicit criteria for reopening the controller question if the data warrants it.
The control theory framing remains intentional, but its meaning has changed. An observer in the control engineering sense estimates internal state from external outputs. The observer here is now best understood as exactly that — a state estimator and characterization tool for transformer trajectories — without the active feedback loop that the v1 framing claimed and v2 falsified. Whether a different trigger signal (one that actually correlates with downstream output failure) could rebuild a working controller is an open question this work has not answered, and is the central question for any v3.
This project was developed by an independent researcher without formal ML or
software engineering training and with no prior programming experience, using
AI-assisted implementation workflows and rented compute. We include this as
methodological context. The decision to invest in falsifying the v1 central
claim — rather than continuing to tune it — was set by the author and
implemented against artifacts that were jointly reviewed before being treated
as evidence. All code, runs, and claims were human-reviewed against generated
artifacts. The complete F-numbered evidence chain (F1–F31) is preserved in
RESEARCH_CONTROLLER.md and RESEARCH.md in the
repository, with full per-experiment run identifiers so any claim in this paper
can be traced to its underlying run artifacts.
Repository: github.com/aeon0199/observer
License: MIT. Cite via CITATION.cff.
Selected references: Nanda et al. (2022) TransformerLens. Wu et al. (2024) pyvene. Zou et al. (2023) Representation Engineering. Li et al. (2023) Inference-Time Intervention. Raj et al. (2023); Huang et al. (2023). Rodriguez et al. (2025) LinEAS (arXiv:2503.10679). Cheng et al. (2025) FASB (arXiv:2508.17621). Grant et al. (2025) (arXiv:2511.04638). Hu et al. (2025) HARP (arXiv:2509.11536). Shapiro et al. (2026) HALT (arXiv:2602.02888). Johnson & Lindenstrauss (1984) Extensions of Lipschitz mappings into a Hilbert space.