observer: Runtime Instrumentation for Trajectory Mapping in Language Models

Abstract

We present observer, an open-source runtime stack for studying perturbation dynamics in autoregressive language models. The system provides four independently usable protocol layers: a three-stage hysteresis protocol for measuring perturbation persistence, a single-pass observability runner with streaming diagnostics, a deterministic intervention engine built around a SeedCache branchpoint design that eliminates common confounds in intervention experiments, and a closed-loop adaptive controller that applies proportional damping in response to a per-token divergence signal. The divergence signal is derived from a VAR(1) model fit on projected hidden states, providing a held-out one-step prediction error.

We use this instrument to falsify our own initial closed-loop stability control hypothesis on Qwen3-1.7B and report what the experiments actually showed: the divergence signal correlates with token-level prose surprise (word-starts, structural boundaries, semantic transitions) rather than dynamical instability in a control-theoretic sense. Closed-loop control over this signal is not effective on the tested model. We then report a positive finding that emerged from the same apparatus: a branchpoint hijacking phenomenon in which final-layer additive perturbations can flip individual tokens at predictable positions, characterized by a within-model classifier achieving AUROC 0.82 on a procedural prompt and 0.86 on a descriptive prompt (Qwen3-1.7B, held-out pair-level evaluation). The classifier's predictive features differ between prompt classes, suggesting prompt-class-dependent trajectory geometry. We describe the architecture, the experimental protocols, the falsification arc, and a mapping program organized around three open questions with explicit stop conditions and controller-return criteria.

§2 Related Work

Observer occupies a space adjacent to several lines of existing work, without directly duplicating any of them.

Intervention Tooling

TransformerLens (Nanda, 2022) provides the dominant toolkit for mechanistic interpretability research: model loading, hook-based activation capture and modification, and a large community of research built on its abstractions. It is an exploration tool — excellent for research notebooks and circuit analysis, not designed around systematic experimental protocols or recovery measurement.

pyvene (Wu et al., 2024) formalizes interventions as first-class serializable primitives, enabling composable intervention specifications across locations, granularity, and sequence position. It is an execution library: it provides the mechanics of intervention without opinions about experimental design, hysteresis, or recovery.

nnsight provides a Pythonic interface for local and remote model execution, including access to frontier models via the NDIF infrastructure. Observer supports nnsight as an optional backend, inheriting its remote execution capabilities.

Representation Engineering and Steering

The Representation Engineering paper (Zou et al., 2023) demonstrated that model behavioral tendencies can be read from and written to activation space via linear probes and steering vectors. The Inference-Time Intervention paper (Li et al., 2023) applied shifted activations at inference time, improving TruthfulQA performance from 32.5% to 65.1%. Neither line of work focused on recovery dynamics or closed-loop feedback.

LLM Stability

Recent work on LLM output consistency (Raj et al., 2023; Huang et al., 2023) characterizes stability at the output level — how often does the same model produce the same answer across runs? Observer operates at a different layer: activation-level perturbation dynamics within a single generation, not output-level consistency across generations.

Recent Developments (2025–2026)

LinEAS (Rodriguez et al., NeurIPS 2025; arXiv:2503.10679) trains activation steering end-to-end with a global distributional loss, showing that locally tuned maps produce unintended downstream shifts when applied out-of-sample. Observer's adaptive controller is designed to detect and respond to such downstream cascades in real time.

FASB (Cheng et al., 2025; arXiv:2508.17621) dynamically determines intervention necessity and strength by tracking internal states during generation, with a backtracking mechanism to correct deviated tokens. Observer shares the adaptive framing but adds deterministic branchpointing and explicit recovery measurement, quantifying whether the trajectory recovered or remained shifted after intervention ended.

Grant et al. (2025; arXiv:2511.04638) provide a theoretical treatment of how causal interventions can push representations off the model's natural manifold, distinguishing benign null-space divergences from pernicious ones that activate dormant pathways. Observer's PLASTIC and DIVERGENT regime classifications can be interpreted through this taxonomy, offering empirical runtime signatures for divergence types their framework characterizes theoretically.

HARP (Hu et al., 2025; arXiv:2509.11536) decomposes hidden state space into semantic and reasoning subspaces via SVD of the unembedding layer, achieving AUROC 92.8% on TriviaQA hallucination detection. Observer's windowed SVD probe tracks effective rank dynamically within a generation rather than using static subspace decomposition for classification, a complementary signal.

HALT (Shapiro, Taneja, and Goel, Feb 2026; arXiv:2602.02888) treats token log-probability sequences as a time series for lightweight hallucination detection without requiring internal model access. Observer's VAR(1) predictor applies a related time-series framing to hidden state trajectories, a white-box signal that feeds an active intervention loop rather than a post-hoc detector.

The Gap

Observer's distinguishing contribution is the combination of three things none of the above provide together: deterministic branchpointing (identical model state for both branches, eliminating confounds), recovery measurement (quantifying whether the trajectory returns to baseline after intervention ends), and closed-loop control (a token-level controller that detects instability and intervenes in real time). Existing tools execute interventions. Observer measures what they do, whether the model recovers, and acts on that signal continuously.

§4 SeedCache: Deterministic Branchpointing

The central design problem in intervention experiments is confounding. A naive implementation runs the baseline and intervention branches from separate forward passes over the same prompt. This introduces at minimum: different random number generator states at the point of token sampling (even under greedy decoding, CUDA operations can have ordering nondeterminism), and potentially different attention mask states depending on the batching implementation.

The SeedCache resolves this by running the prompt exactly once, then cloning the resulting model state for both branches:

# Run prompt once, snapshot pre-generation state
def build_seed_cache(model, tokenizer, device, prompt, layer) -> SeedCache:
    hook = _HiddenCaptureHook()
    handle = layers[layer].register_forward_hook(hook)
    
    with torch.no_grad():
        outputs = model(input_ids, use_cache=True, return_dict=True)
    handle.remove()
    
    return SeedCache(
        past_key_values = outputs.past_key_values,   # full KV cache
        next_token_logits = outputs.logits[:,-1,:],  # first token dist
        seed_hidden = hook.captured,                 # hidden @ layer
        fingerprint = compute_cache_fingerprint(...) # checksum
    )

# Both branches start from identical state
baseline_cache = seed_cache.clone()
intervention_cache = seed_cache.clone()
# SeedCache.clone() deep-copies past_key_values via clone_past_key_values()
# handles DynamicCache, legacy tuple-of-tuples, and generic objectscache.py

The fingerprint — derived from the first-layer key cache statistics — provides a checksum that experiments can log to verify both branches genuinely share a common origin. This is the kind of rigor that most published intervention papers treat as an implementation detail but actually matters for result validity.

Why this matters: Without a shared branchpoint, "recovery" measurements conflate genuine behavioral change with noise introduced by divergent initial conditions. The SeedCache makes the comparison meaningful.

§5 The Divergence Signal

The core signal feeding both the observability runner and the adaptive controller is a per-token held-out prediction error from a VAR(1) model fit on a sliding window of projected hidden states.

Projection

The hidden state h_t ∈ ℝ^D (where D is the model's hidden dimension, typically 4096–8192) is projected to a fixed low-dimensional space via a deterministic Rademacher matrix:

z_t = h_t · P, P ∈ {±1/√k}^D×k, k = 64
P is fixed for the run lifetime (seeded, deterministic)

The Rademacher projection preserves inner products in expectation (Johnson-Lindenstrauss), reduces the regression problem from D-dimensional to k-dimensional (k=64), and is computed once per hidden dimension via a seeded RNG — making it reproducible across runs and comparable across model families with different hidden sizes.

VAR(1) Dynamics

A first-order vector autoregressive model is fit on the sliding window W = {z_t-n, ..., z_t-1} via ridge regression:

z_t ≈ z_t-1 · A, A ∈ ℝ^k×k
(X^TX + λI) A = X^TY, λ = 0.01

Critically, the matrix A is fit on the window excluding the newest state z_t. The prediction ẑ_t = z_t-1 · A is then compared to the actual observed z_t. This is a held-out evaluation: the model is never trained on the transition it is asked to predict. This matters because in-sample VAR(1) error on a short window would collapse toward zero regardless of actual trajectory instability.

Divergence Score

The per-token scalar divergence combines normalized L2 error and cosine distance with a symmetric denominator to avoid blow-ups when projected norms are near zero:

L2_norm = ||ẑ_t − z_t|| / (0.5 · (||ẑ_t|| + ||z_t||) + ε)
cos_dist = 1 − (ẑ_t · z_t) / (||ẑ_t|| · ||z_t|| + ε)
divergence = 0.7 · L2_norm + 0.3 · cos_dist

When the hidden trajectory is locally predictable, the VAR(1) fit is good and divergence is low. When generation dynamics shift — through perturbation, distributional shift in the prompt context, or internal instability — the held-out prediction error increases. The signal is cheap: one matrix multiply per token in 64-dimensional space.

def step(self, hidden: torch.Tensor) -> float:
    z = self._project(hidden)          # (D,) → (64,)
    self._window.add(z)                # FIFO buffer, maxlen=8
    
    if len(self._window) < 3:
        return 0.0
    
    states = self._window.matrix()     # (T, 64)
    train  = states[:-1, :]           # exclude newest
    A      = _fit_var1_ridge(train)    # fit on T-1 transitions
    
    pred   = states[-2, :] @ A       # predict from t-1
    actual = states[-1, :]           # held-out: actual t
    
    return _divergence(pred, actual)["combined"]predictor.py

§6 Supplementary Diagnostics

The divergence signal is the primary input to the controller, but the V1.5 observability runner and the adaptive controller also compute three supplementary diagnostics that provide corroborating signal and richer telemetry for offline analysis.

Spectral Diagnostics

v2 correction. The original spectral module FFT'd the flattened hidden-state vector along the feature-index axis and reported entropy, flatness, centroid, and band fractions over that spectrum. The v1 paper acknowledged that "the feature index is not a temporal axis" but defended the metrics as a stable characterization of activation energy distribution. This defense does not survive a permutation test: neuron ordering in transformer hidden states is arbitrary (a function of weight initialization, not semantics), and any neuron-axis FFT summary is a function of that arbitrary ordering. Permuting neurons changes every reported metric; the underlying activation is unchanged.

The v2 implementation rewrites this module as a token-time spectral probe: hidden states are accumulated into a sliding window of shape [T, D] and the FFT is taken along the time axis (dim=0). Per-frequency power is then averaged across the D dimensions, producing a scalar trajectory spectrum. This captures real structure — slow drift vs. high-frequency oscillation in activation patterns across generation steps — and is invariant to neuron permutation. A built-in self-test reports a non-zero permutation-change ratio whenever the window has at least 8 tokens, confirming the time axis is in fact what's being analyzed.

The corrected metrics, computed on the time-axis trajectory spectrum, are:

Metric	Description
`spectral_entropy`	Normalized Shannon entropy of the time-axis power spectrum. High = energy spread across slow and fast trajectory frequencies.
`spectral_flatness`	Geometric mean / arithmetic mean of power. Approaches 1.0 for white-noise trajectories, 0.0 for tonally pure ones.
`centroid`	Normalized frequency centroid ∈ [0,1]. High centroid = trajectory dominated by step-to-step oscillation rather than slow drift.
`high_frac`	Fraction of power in the upper 20% of trajectory frequencies.
`rolloff_85`	Normalized frequency below which 85% of cumulative power falls.
`permutation_change`	(new in v2) Self-test ratio comparing the spectrum of the actual trajectory to the spectrum of a randomly time-permuted version of the same window. Should be > 0 — confirms time-axis behavior. If a future regression makes this near zero we know the module has reverted to neuron-axis behavior.

Empirically, permutation_change turned out to be the strongest single feature for predicting branchpoint flippability on Qwen3-1.7B in §11.5 — a feature that was conceived as a methodology self-test ended up carrying real signal about trajectory geometry.

Windowed SVD

A window of hidden vectors {h_t-w, ..., h_t} ∈ ℝ^W×D is stacked into a matrix X and its singular value decomposition computed via the Gram trick: eigenvalues of XX^T (a W×W matrix with small W) yield the squared singular values without requiring the full D×D computation. An SVD of a single vector returns only the vector norm — uninformative. The windowed approach captures the local rank structure of the trajectory: whether the model is moving through a low-dimensional manifold or exploring higher-dimensional space.

Effective rank is computed as exp(H(p)) where p is the normalized singular value distribution — the exponential of the entropy of squared singular values. A drop in effective rank signals that the trajectory is collapsing onto a lower-dimensional subspace, a potential precursor to repetition or mode collapse.

Layer Stiffness

At three probed layers (early / mid / late), the velocity norm v_t = ||h_t^L − h_t-1^L||₂ is tracked over a sliding window. Mean velocity defines stiffness; the linear slope of velocity over the window defines stiffness trend. Elasticity = 1/(1 + stiffness) provides a bounded stability score in (0,1]. This is a diagnostic proxy, not a physical quantity.

§7 The Hysteresis Protocol

The baseline hysteresis module implements a three-stage experimental protocol for measuring how much of a perturbation's effect persists after the perturbation is removed.

Stage 1: BASE
  ─────────────────────────────────────────────────────
  Prompt → SeedCache → greedy generation
  Capture: hidden_norm, entropy, logit_norm, SVD spectrum
  
Stage 2: PERTURB
  ─────────────────────────────────────────────────────
  Same SeedCache + Delta instruction injected
  Capture same statistics
  KV cache retained for Stage 3
  
Stage 3: REASK
  ─────────────────────────────────────────────────────
  Continue from PERTURB's KV cache
  Minimal re-ask (no repeated prompt)
  Perturbation still in context; does model return to BASE?

Metrics:
  D = composite distance(BASE, PERTURB)      ← drift
  H = composite distance(BASE, REASK)        ← hysteresis
  R = 1 - H / (D + ε)                       ← recovery ∈ (-∞, 1]

The composite distance used in the metric computation draws on four component signals: relative hidden norm difference, entropy distance, relative logit norm difference, and SVD spectral distance (normalized L2 between top singular value vectors). These are combined with equal weights except logit norm (0.5×), reflecting that hidden state geometry carries more signal than logit magnitude.

Recovery R is classified into four regimes:

ELASTIC

R > 0.8. Model substantially returns to baseline behavior despite perturbation remaining in context.

PARTIAL

0.4 < R ≤ 0.8. Partial recovery; residual perturbation effect visible in trajectory statistics.

PLASTIC

0 ≤ R ≤ 0.4. Perturbation effect persists significantly. Model has been durably steered.

DIVERGENT

R < 0. REASK is further from BASE than PERTURB was. Perturbation has amplified rather than decayed.

This taxonomy provides vocabulary for characterizing perturbation experiments that the field currently lacks. Whether a given prompt-perturbation pair produces elastic, plastic, or divergent behavior is a property of the model that is currently unknown for most practically relevant perturbation types.

§8 Intervention Engine

The intervention engine is the core experimental workhorse. It runs baseline and intervention branches from a shared SeedCache, captures full hidden trajectories from both, and computes a rich set of comparison metrics.

Intervention Types

Type	Operation	Parameters
additive	Add a unit random vector scaled by magnitude to last-token hidden state.	`magnitude`, `seed`
projection	Project out a random k-dimensional subspace: h ← h (I − QQ^T)	`subspace_dim`, `seed`
scaling	Multiply last-token hidden state by scalar s.	`scale`
sae	Steer along SAE decoder column for a specified feature index.	`sae_repo`, `feature_idx`, `strength`

Hooks are registered with register_forward_hook and removed in finally blocks. Critically, the intervention is applied before the hook captures the hidden state — so the captured tensor reflects what downstream layers actually receive, not the pre-intervention value. This is the correct ordering that many published implementations miss.

Trajectory Comparison

The TrajectoryComparison object implements a layered fallback strategy for computing per-token distances between branches:

The primary metric is cosine distance on actual hidden vectors (preferred). If hooks fail to attach and hidden vectors are unavailable, it falls back to Jensen-Shannon divergence on the logit distributions. If logits are also unavailable, it falls back to normalized L2 on hidden norms. The code documents this explicitly: "hidden_norm alone is not sufficient — the same norm can hide large vector changes."

Recovery is computed over the post-intervention window: deviation_during (mean primary metric during active intervention), final_distance (primary metric at final token), recovery_ratio = (deviation_during − final_distance) / deviation_during, and convergence_rate (negative slope of primary metric over post-intervention tokens via linear fit).

§9 Adaptive Controller

The adaptive controller closes the loop: per-token diagnostics drive a proportional scaling intervention that damps the hidden state when the composite score exceeds a threshold.

v2 status. The architecture in this section is unchanged from v1, and the implementation runs as described. What changed is the empirical story. The controller is no longer presented as a working component with interesting attractor-selection behavior; it is presented as a falsifiable hypothesis that we falsified. §10 reports the failure modes (silent no-op at L=−1 with scaling, over-actuation at higher additive magnitudes, destructive cascade when acting earlier in the stack). The controller code remains in the repository as research scaffolding for a future redesign — the criteria under which controller research would resume are listed in §11.7.

Composite Score

score_t = 0.70 · divergence_t + 0.15 · max(0, spectral_entropy_t − 0.75) + 0.10 · max(0, high_frac_t − 0.30) + 0.05 · |eff_rank_t − eff_rank_t-1|

The spectral and SVD terms are gated — they only contribute when they exceed a baseline level (spectral entropy above 0.75, high-frequency fraction above 0.30), to avoid penalizing normal variation. The rank delta term detects sudden changes in trajectory dimensionality.

Control Logic

A 3-token moving average of the score is computed. When the smoothed score exceeds a threshold, the controller applies a scaling intervention to the last-token hidden state at the monitored layer, then enters a cooldown period during which the scale is held and further threshold evaluations are suppressed:

Status	Condition	Scale Applied	Cooldown
STABLE	avg_score ≤ 0.55	1.0 (no intervention)	—
WARNING	0.55 < avg_score ≤ 0.85	0.90	3 tokens
CRITICAL	avg_score > 0.85	0.75	6 tokens
COOLDOWN	Post-intervention hold	Held from trigger	Counting down

The scaling intervention multiplies the hidden state: h_t ← s · h_t. This reduces the magnitude of the current representation, which typically reduces the entropy of the downstream logit distribution and pulls the model toward its modal behavior. The mechanism is simple and its effects are legible — a deliberate choice given that the controller is a research instrument, not a production component.

Shadow Mode

When --shadow is set, the controller observes and logs its decisions but does not apply the scaling hook. This allows calibration of threshold and weight parameters on a given model and prompt distribution before active deployment. The separation of observation and actuation is explicit in the code: if (scale_used < 1.0) and (not shadow): hook.set_active(True).

§10 Experimental Results: Controller Arc on Qwen3-1.7B

v2 replacement. The v1 §10 reported two experiment families on Qwen2.5-7B that, in retrospect, lacked controls we now know are required: per-step intervention-applied counts, decoupled measure/act layers, and pair-level shadow/active comparison. The v1 headline ("controller aggressiveness determines which attractor the model lands in") was reported as a controller property but is consistent with two alternative explanations we could not rule out at the time: Qwen2.5-specific scaling behavior, and prompt-seed-specific basin topology independent of the controller. The v1 results are preserved in the repository's RESEARCH_CONTROLLER.md archive for historical context.

In v2 we ran a longer arc on Qwen3-1.7B (28 layers) with explicit controls. We report it here in the order the experiments happened, including the negative results, because the negative results determine the structure of the rest of the paper. The model used throughout this section is Qwen3-1.7B; generalization to other architectures is treated as out of scope for the current mapping program (one cross-model scope-check is reported in §11.5).

F4 — The original closed-loop A/B test

On the prompt "Write step-by-step instructions for baking sourdough bread.", the closed-loop controller at its v1 design defaults (act_layer=−1, intervention_type= scaling, scale_warn=0.90, scale_crit=0.75) produced essentially the same per-token divergence as shadow mode. avg_raw_div: 0.765 (shadow) vs 0.768 (active). Warning counts: 8 vs 7. Critical counts: 1 vs 1. With 5 seeds × 1 prompt the difference is well within seed variance. This was the first signal that something in the pipeline did not work as advertised.

F17 — Why scaling at the final layer has zero effect

A diagnostic stress run isolated the cause. With scale=0.5 (halving the final-layer hidden state) we measured logit_kl_mean_during = 0.0000 across 5 seeds. With scale=2.0 (doubling), also 0.0000. token_match_rate = 1.000 in both cases. The scaling intervention was a true no-op at L=−1.

The mechanism is structural: Qwen3 places an RMSNorm between the last transformer block and the LM head. Scaling the input to that norm by any constant factor is erased — the norm rescales to unit variance, the LM head sees an essentially identical input, and the argmax is unchanged. The closed loop in F4 was firing the controller (the trigger was active), the scaling intervention was applied, and the intervention had no downstream effect. The closed loop was open at the actuator.

F18 — Additive intervention is the right actuator class at L=−1

A 4-way intervention-type comparison at the same layer (additive, scaling@0.5, scaling@2.0, projection-onto-64-dim-subspace) clarified what does work. Additive perturbation with relative magnitude 1.0 produced logit_kl_mean_during = 10.40 ± 2.78 across 5 seeds (DSR = 3.73), with token_match_rate = 0.145 — that is, 85% of generated tokens differed from baseline. Projection produced larger logit shifts but always landed in a runaway regime. Scaling at any magnitude produced exactly zero. The conclusion was that at L=−1 on Qwen3-1.7B, additive perturbation is the only intervention class that reliably reaches the LM head decision distribution.

F21 — Random additive in a closed loop is net-zero

Returning to the F4 setup with additive replacing scaling and a random seeded direction: shadow vs. active avg_raw_div = 0.6732 vs 0.6728 (Δ = +0.0004 across 5 seeds). Per-seed, the active runs split — some flipped a few tokens and improved divergence, others flipped tokens and worsened it. The expected value of a random direction was zero, and that is what the data showed. This was the first direct evidence that the controller's actuator was now reaching the model (token_match_rate ≠ 1.000 on most seeds) but its effect was undirected.

F23 / F24 — Drift-opposing direction helps marginally and unreliably

Two follow-up experiments replaced the random additive direction with a drift- opposing one. The implementation maintains a reference hidden state (either an EMA of recent hidden states or a frozen anchor from the first N clean tokens), computes drift = h_current − h_reference at each step, normalizes, and injects −β·drift_direction as the corrective additive delta. The intuition matches a textbook proportional controller pulling toward a setpoint.

EMA reference (F23): avg_raw_div 0.6547 vs shadow 0.6732 — a 2.74% improvement, 0.106 shadow-stdevs. Anchor reference (F24): 0.6537, an additional 0.001 improvement. Both improvements are concentrated in a single seed where the controller's first intervention coincided with a token-flip that diverted the trajectory into a more coherent basin (the F25 phenomenon, characterized below). Other seeds were unchanged or slightly worse.

F25 — Branchpoint hijacking: the actual mechanism behind F23/F24's gains

Per-step trace analysis of F23/F24 revealed that the small aggregate improvements were not closed-loop control. On one of the F24 seeds, the controller's first intervention at step 7 flipped a single token (a leading whitespace became "Use"), and the subsequent generation entered an entirely different output basin — a coherent recipe with explicit ingredients ("100g flour, 100g water, 100g…") in place of shadow's degenerate numbered-list stub ("1. Prepare 2. Mix 3. Let…"). On other seeds with already-coherent baselines, the controller fired repeatedly, flipped no tokens, and active output was character-identical to shadow.

This pattern is reproducible: small final-layer additive perturbations can flip individual tokens at branchpoints where the LM head's top-2 logit margin is small, and the resulting trajectory enters a different attractor. The visible "controller helps" effect is one such hijack landing in a better basin. The visible "controller does nothing" effect is the controller firing in regions where its perturbation is smaller than the local logit margin. The controller is not stabilizing trajectories. It is occasionally redirecting them at branchpoints, and whether the new basin is better or worse is a property of the basin, not the controller.

F26 — The measurement-layer confound

A first attempt to fix F25 by moving the actuation layer one step back (act_layer = −2) produced what looked like a clean 1.7σ improvement on aggregate avg_raw_div. Inspection of intervention_applied counts in events.jsonl showed the controller fired exactly 1 time across 5 active runs. The "improvement" was entirely an artifact of moving measure_layer from −1 to −2 in the same step (the original implementation required them to be equal). Divergence at L=−2 is naturally lower than at L=−1, and the comparison was apples-to-oranges. The controller was a spectator.

Resolving this required a small code change: decouple measure_layer from act_layer in the runtime engine, with separate forward hooks for measurement (capture-only) and actuation (modify). Once decoupled, the layer-move hypothesis could be tested honestly.

F27 — Layer-move is falsified once measured correctly

With measure_layer=−1 held fixed (identical signal to shadow) and act_layer ∈ {−1, −2, −3}, on the same 5-seed sourdough suite:

Cell	avg_raw_div	Δ vs shadow	interventions fired
SHADOW	0.6732 ± 0.174	—	0
ACT_L−1	0.6537	+0.020 (+0.11σ)	19
ACT_L−2	0.7057	−0.033 (−0.19σ)	22
ACT_L−3	0.6918	−0.019 (−0.11σ)	25

Acting one layer back makes things worse, not better. Acting two layers back is also worse. The controller is firing at similar rates in all three configurations (~20 interventions per 5-seed suite), so this is not a "intervention never fires" artifact. The interpretation is that perturbations earlier in the residual stream cascade through subsequent attention and MLP layers, accumulating drift rather than steering it. This closes out the layer-placement direction of controller redesign.

F28 — Synthesis: divergence measures prose surprise, not dynamical instability

The cumulative result of the controller arc. Across nine findings (F4, F17, F18, F21, F22, F23, F24, F26, F27), every controller variant either (a) does nothing because the actuator is absorbed by an intervening norm layer, (b) does nothing because the perturbation is smaller than local logit margins, (c) opportunistically hijacks one branchpoint per seed and otherwise does nothing, or (d) is destabilizing rather than stabilizing.

The simplest explanation that fits all of these is that the divergence signal we are measuring does not measure what closed-loop stability control needs it to measure. Inspection of the highest-divergence steps in observe runs (replicated from v1's "divergence spikes at structural boundaries" finding, which we confirm) shows that the signal spikes at word-starts, punctuation, the transitions between numbered list items, and the boundaries between semantic units. These are normal features of well-formed prose, not symptoms of trajectory destabilization. Closed-loop intervention on this signal is therefore a controller fighting prose structure.

The instrument is sound. The trigger signal is not what it was framed as. The naive controller redesign space (varying intervention class, magnitude, direction, layer, reference rule) is exhausted within the observe-run regime we can support on commodity hardware.

§11 Branchpoint Hijacking: A Positive Finding

The controller arc was a falsified hypothesis. The same apparatus produced an unfalsified mechanism worth reporting. We call it branchpoint hijacking: additive perturbations applied to the final transformer layer can flip individual tokens at predictable positions in the generation sequence. The mechanism reaches the LM head (unlike scaling, which is absorbed by RMSNorm), it is reproducible across seeds, and it generalizes architecturally. Whether a flip improves or degrades the resulting output is model- and prompt-specific — that is the consequence-side limit, not a flaw in the mechanism.

F25 — The mechanism, observed

A small number of additive perturbations applied during a stress run can produce one or more single-token flips at positions where the LM head's argmax is margin-vulnerable. Once a single token has been flipped, subsequent tokens are drawn from a different conditional distribution, and the trajectory enters a different attractor with its own local dynamics. The visible footprint of a successful hijack on the F23/F24 sourdough runs is one token of difference between shadow and active output, followed by a continuation that is structurally and semantically distinct.

F29 — Architectural generalization vs. consequence specificity

A scope check on TinyLlama-1.1B (Llama family architecture, vs. Qwen-family for Qwen3-1.7B) using the same configuration confirmed that the flip mechanism generalizes. On 3 of 5 seeds with non-degenerate generation, the active cell flipped tokens vs shadow with the controller firing 9–10 times per seed. However, on TinyLlama every hijacked seed landed in a worse basin: avg_raw_div was higher in active than shadow on all three (e.g., seed 2: 1.006 → 1.066, output became "Pleaseincludeingredients,measurements,bakingtime…" with whitespace tokens dropped). This contrasts with Qwen3 sourdough seed 2, where the same mechanism landed the trajectory in a coherent recipe basin.

The conclusion: the perturbation-induced branchpoint flip is architecture-general; whether the new basin is better or worse is a property of the model's basin structure for that prompt, not a property of the perturbation. F25 was effectively two claims in one. The mechanism replicates; the consequence does not.

F31 — Branchpoint flippability is predictable from the clean trajectory

A natural follow-up question: given a step in clean (unperturbed) generation, can we predict whether a small perturbation at that step would flip the next token? We address this offline using existing control-mode runs. For each pair of matched shadow/active runs that share (model, prompt, seed, max_tokens, temperature, measure_layer, act_layer) and where the active run actually fires the controller, we construct per-step training rows: features extracted from the shadow trajectory at step t, label 1 if the active and shadow tokens at step t differ.

Critical methodology note: features must come from the shadow trajectory. An earlier analyzer pulled features from the active trajectory and reported AUROC = 0.80, but the active trajectory contains intervention-downstream fields (intervention_applied, scale_used, controller_drift_norm) that are tautologically correlated with the flip label — non-zero only on steps where the controller fired. The 0.80 was a data leak, not a predictor.

With shadow-trajectory features only, on Qwen3-1.7B with pair-level train/test split (80/20) and matched configurations:

Prompt	Pairs	Step rows	Held-out AUROC
Sourdough (procedural)	12	576	0.82
Water cycle (descriptive)	5	240	0.86

Both prompts independently clear an AUROC threshold of 0.80 within Qwen3-1.7B. The features that drive the prediction differ between prompt classes:

Feature	Sourdough sign	Water cycle sign	Mechanistic interpretation
`step_idx`	+ (later)	+ (later, AUROC 0.80 alone)	Baseline divergence accumulates with sequence length; later tokens have more chances to be at margin.
`spectral.permutation_change`	+ (high)	− (low)	Time-axis spectral structure; sign flips between procedural and descriptive prompts.
`layer_stiffness.−1.elasticity`	+ (low velocity)	+	Slow-moving final-layer activations are at branchpoints more often.
`svd.top1_energy_frac`	+ (concentrated)	—	Trajectory concentrated in dominant direction — local instability.
`spectral.total_power`	− (high)	−	High-energy steps are less flippable.

The sign flip on spectral.permutation_change between procedural and descriptive generation is the most interesting line item. It suggests Qwen3-1.7B drives different trajectory geometries for different prompt classes, and the geometric signature of a "branchpoint" depends on which class. The only feature with a consistent sign across both prompts is step_idx — generation position — which is the universal-but-weak predictor.

The positive finding, stated plainly. Within Qwen3-1.7B, given a step of clean generation, we can predict at AUROC ≥ 0.80 whether a small additive perturbation at that step would flip the next token. The features driving the prediction are geometric properties of the unperturbed trajectory at that step, and those features are prompt-class-dependent. The mechanism by which the flip occurs (additive perturbation reaching the LM head's argmax margin) generalizes to a Llama-family model; whether the resulting trajectory is better or worse does not.

§12 The Mapping Program

Following the controller arc's falsification, the project's center of gravity shifts from "build a closed-loop controller" to "map the geometry of trajectory sensitivity, persistence, and branchpoint behavior in Qwen3-1.7B." This is a deliberately scoped agenda — single model, mechanistic questions, explicit stop conditions per question — designed to keep the work falsifiable and finite.

Three open questions

	Question	Stop condition	Status
Q1	Branchpoint geometry: when are tokens flippable?	Within-Qwen3 held-out AUROC ≥ 0.80 across ≥2 prompts.	Closed (F31)
Q2	Perturbation propagation: how does an injected delta evolve through the residual stream?	Per-layer propagation curves that mechanistically explain F27.	Open
Q3	Basin structure: when does a flip improve vs. degrade output?	Pre-flip generation feature predicts improve-vs-degrade with AUROC ≥ 0.7 on Qwen3-1.7B.	Open (F29 is one cross-model data point)

Controller-return criteria

The closed-loop controller research direction is paused, not abandoned. We explicitly define the conditions under which it would be reasonable to reopen: the controller returns to active investigation when any two of the following become true.

Q1 produces a branchpoint-flippability predictor with precision ≥ 0.7 on Qwen3-1.7B (provides a real trigger to replace the falsified divergence-as-instability trigger).
Q2 identifies a layer with bounded, non-cascading perturbation propagation (provides a principled choice of act_layer instead of the current trial-and-error).
Q3 identifies a feature of the pre-flip generation that predicts whether a flip will improve output (provides a meaningful gate: act only when expected improvement).

If two of these land, a controller redesign experiment using the new trigger, layer, and gate becomes worth running. If none land within the mapping program, the mechanistic findings (F25, F29, F31, plus Q2 and Q3 results) stand on their own as an interpretability contribution and the controller stays paused.

What does not justify reopening the controller is more tuning of the existing design space. The combinations (scaling | additive-random | additive-EMA-opposing | additive-anchor-opposing) × (L=−1, L=−2, L=−3) × (magnitude ∈ {0.3, 0.6, 0.8, 1.2}) have all been tested and are recorded in §10 and the RESEARCH_CONTROLLER.md archive. Reopening this space without new information from the mapping work would be blind tuning.

§13 Reproducibility Infrastructure

Observer is designed to produce artifacts that can support publishable claims, not just exploratory analysis. The compute environment for the reported experiments is a single NVIDIA H200 GPU via RunPod.

Every run produces a config hash (SHA-256 of the full experiment configuration, sorted-key JSON) and a seed cache fingerprint (statistics of the first-layer key cache). These allow reconstruction of run identity and verification that two runs claiming to share a branchpoint actually do. The full experiment configuration, trajectory data, and computed metrics are written to structured JSON artifacts for every run.

The included REPRODUCIBILITY.md specifies a reporting checklist for public claims: pin commit hash in every figure caption; report model key, backend, seed, and intervention settings; run at least 3 seeds per comparison; report mean + confidence interval, not best run; publish raw results.json used for plots. This is the standard that is routinely absent from published intervention work.

The CI workflow runs a compileall pass over the active runtime package and the orchestrator scripts on every push, plus a unittest discover pass that exercises a small suite of guard tests. The guard tests catch the kind of integration regression a previous Codex audit caught manually: CLI semantic-layer parsing, branchpoint-analyzer default values, README quickstart pointing at the unified runtime entry points rather than the legacy v1/v1.5/v2 scripts. The repository keeps two top-level research documents: RESEARCH.md tracks the active mapping program and is the entry point for any new session, and RESEARCH_CONTROLLER.md archives the completed controller arc with full F-numbered findings so subsequent agents can cite established facts without re-deriving them. docs/RESEARCH_WORKFLOW.md documents the experiment handoff protocol both documents follow.

§14 Limitations and Open Questions

Several limitations constrain v2's claims, in addition to the central limitation documented in §10 (the divergence trigger does not measure what closed-loop stability needs it to measure).

Single-model scope. The mapping program in §12 is explicitly scoped to Qwen3-1.7B. F25 Part A (the flip mechanism) replicates on TinyLlama-1.1B (F29); F25 Part B (the basin direction) does not. F31's branchpoint predictor is within- Qwen3 only. Cross-model generalization is a deferred research question, not a settled one. Anyone applying these results to a different architecture should expect F31's specific predictive features (e.g., the sign of spectral.permutation_change) to need recalibration per model.

Prompt-class diversity. F31 closes Q1 on Qwen3 with two prompts — one procedural ("Write step-by-step instructions for baking sourdough bread.") and one descriptive ("Describe the water cycle in a few sentences.") — and finds that predictive features differ between them. Three-or-more prompt classes (reasoning, code, creative) would harden the prompt-class-dependent geometry claim and is an immediate Q1-extension experiment.

Sampling. v1 noted that all generation used greedy argmax. v2 added temperature/top-p/top-k sampling support, but a subtle finding emerged: with matched seeds, torch.multinomial can produce the same drawn token from slightly different conditional distributions, so logit-shift effects are masked at the token level even when present. Future work should use unmatched branch seeds when measuring perturbation effects on sampling-mode generation.

Controller-mode logit features not logged. A natural extension of the F31 branchpoint predictor would include top-2 logit margin and per-step logit entropy as features. Both are architecture-invariant and should be the most direct mechanistic predictors of flippability. Neither is currently written into events.jsonl; adding them is queued as a small code task and would enable a more principled cross-model F31 follow-up.

Asserted controller weights. The 70/15/10/5 weighting in the v1 composite score (§9) was a design choice, not derived from empirical optimization. In v2 this is moot — the composite-score-driven controller is paused as a class — but if controller research resumes the weighting choice should be re-derived from the new trigger signal rather than carried forward.

Architecture coverage. Layer discovery currently handles Llama/ Qwen-style (model.model.layers), GPT-2/GPT-J (transformer.h), GPT-NeoX (gpt_neox.layers), and encoder-decoder (model.decoder.layers). Models tested in this paper: Qwen3-1.7B (28 layers, primary) and TinyLlama-1.1B-Chat (22 layers, scope check only). Falcon, Mistral (sliding window attention), Gemma, Phi, and Mamba would require additional handling and are out of scope for this preprint.

VAR(1) window constraints. With window size 8, the VAR(1) model is fit on 7 transitions in 64-dimensional space. The ridge regularization (λ=0.01) stabilizes the regression, but statistical power of the prediction error signal is limited, particularly in the first few tokens before the window fills. This is part of why F28's "divergence measures prose surprise" framing makes sense: the predictor is fitting short-window local trajectory dynamics, which are legitimately disrupted at semantic-unit boundaries in normal generation.

Development Context

This project was developed by an independent researcher without formal ML or software engineering training, with no prior programming experience, and without institutional funding. Implementation was carried out through iterative AI-assisted coding workflows over evenings and weekends on rented compute.

We report this as methodological context. The research questions, experimental decisions, and acceptance criteria were set by the author; AI assistants provided implementation support for code generation and revision. All code, runs, and claims were human-reviewed against run artifacts before inclusion in this paper.

Future Work: What v1's Validation Roadmap Became

v1 listed three planned validation experiments. v2 reports what happened to each.

v1 Experiment A (minimal downstream correlation). Status: partially answered, with a different answer than expected. The intended test was "do divergence statistics differ between correct and incorrect outputs." What we found instead is that divergence reliably spikes at structural boundaries in well-formed prose — paragraph breaks, semantic transitions, numbered list markers, word-starts. Most "high divergence" steps are not associated with incorrect output; they are associated with normal writing. This redirected the research from "use divergence as a hallucination detector" to "characterize what divergence actually measures" (F28).

v1 Experiment B (attractor-basin replication). Status: did not replicate as a controller-property claim. v1's headline result on Qwen2.5-7B (controller aggressiveness selects which incorrect-claim attractor the model lands in) could not be reproduced on Qwen3-1.7B because scaling at L=−1 has zero effect on Qwen3 (F17/F22). The basin-selection phenomenon may still be real on Qwen2.5-7B; we cannot confirm or deny without a re-run with the measurement controls v2 added. We do not currently plan that re-run because the mechanistic finding (F25 branchpoint hijacking) supersedes it as a more general and more measurable phenomenon.

v1 Experiment C (signal baseline comparison). Status: partially superseded. The intended test was "does VAR(1) divergence outperform simpler signals." The v2 work did not run that comparison head-to-head, but F31's univariate-feature AUROCs offer a partial answer: the strongest single feature for branchpoint flippability is spectral.permutation_change (a time-axis spectral statistic, AUROC 0.74 alone) or step_idx (literally token position, AUROC 0.80 alone on water-cycle). Neither is the VAR(1) divergence signal. This is a gentle hint that simpler signals may carry much of what divergence carries, and a head-to-head benchmark is worth doing. Queued as a follow-up.

New v2 work queued. Q2 (perturbation propagation) and Q3 (basin structure) — both defined in §12 with stop conditions — are the next planned experiments. Q2 is ~1 hour of offline analysis on existing stress runs; Q3 requires a small new experiment matrix (5 prompt classes × 3 seeds × 2 cells = 30 control runs, ~5 minutes with the warm-model daemon). Closing both within the mapping program would either trigger the controller-return criteria in §12 or establish a clean negative result on closed-loop stability for this model class.

§15 Conclusion

Observer started as an attempt to build a closed-loop stability controller for autoregressive language model generation: detect destabilization in real time, apply proportional damping, observe recovery. The instrument we built does most of what we set out to build — deterministic branchpointing, per-token telemetry, hooked interventions, real-time controller logic — but the central control claim did not survive contact with our own validation experiments. The trigger signal we had been calling "trajectory instability" turned out to measure token-level prose surprise: word-starts, semantic transitions, structural boundaries. A controller built on that signal is, in effect, fighting normal writing.

Reporting the falsification matters. The v1 paper presented closed-loop control as a working contribution, with a striking attractor-selection result on Qwen2.5-7B that, we now suspect, lacked the controls to rule out simpler explanations. The v2 work makes the central claim falsifiable, runs the falsification, and reports it. That is the value of building a research instrument before claiming a research result with it.

What remains is a useful instrument and one positive mechanistic finding. The instrument — SeedCache branchpointing, the unified runtime, the warm daemon, the decoupled measure/act layer hooks, the corrected token-time spectral probe — is sound and reusable. The positive finding — branchpoint hijacking on Qwen3-1.7B with a within-model AUROC predictor that clears 0.80 on two prompt classes — is a concrete interpretability result that the field could build on. The mapping program in §12 organizes what comes next, with explicit stop conditions and explicit criteria for reopening the controller question if the data warrants it.

The control theory framing remains intentional, but its meaning has changed. An observer in the control engineering sense estimates internal state from external outputs. The observer here is now best understood as exactly that — a state estimator and characterization tool for transformer trajectories — without the active feedback loop that the v1 framing claimed and v2 falsified. Whether a different trigger signal (one that actually correlates with downstream output failure) could rebuild a working controller is an open question this work has not answered, and is the central question for any v3.

This project was developed by an independent researcher without formal ML or software engineering training and with no prior programming experience, using AI-assisted implementation workflows and rented compute. We include this as methodological context. The decision to invest in falsifying the v1 central claim — rather than continuing to tune it — was set by the author and implemented against artifacts that were jointly reviewed before being treated as evidence. All code, runs, and claims were human-reviewed against generated artifacts. The complete F-numbered evidence chain (F1–F31) is preserved in RESEARCH_CONTROLLER.md and RESEARCH.md in the repository, with full per-experiment run identifiers so any claim in this paper can be traced to its underlying run artifacts.

Repository: github.com/aeon0199/observer

License: MIT. Cite via CITATION.cff.

Selected references: Nanda et al. (2022) TransformerLens. Wu et al. (2024) pyvene. Zou et al. (2023) Representation Engineering. Li et al. (2023) Inference-Time Intervention. Raj et al. (2023); Huang et al. (2023). Rodriguez et al. (2025) LinEAS (arXiv:2503.10679). Cheng et al. (2025) FASB (arXiv:2508.17621). Grant et al. (2025) (arXiv:2511.04638). Hu et al. (2025) HARP (arXiv:2509.11536). Shapiro et al. (2026) HALT (arXiv:2602.02888). Johnson & Lindenstrauss (1984) Extensions of Lipschitz mappings into a Hilbert space.

§1 Motivation

§2 Related Work

Intervention Tooling

Representation Engineering and Steering

LLM Stability

Recent Developments (2025–2026)

The Gap

§3 Architecture Overview

§4 SeedCache: Deterministic Branchpointing

§5 The Divergence Signal

Projection

VAR(1) Dynamics

Divergence Score

§6 Supplementary Diagnostics

Spectral Diagnostics

Windowed SVD

Layer Stiffness

§7 The Hysteresis Protocol

§8 Intervention Engine

Intervention Types

Trajectory Comparison

§9 Adaptive Controller

Composite Score

Control Logic

Shadow Mode

§10 Experimental Results: Controller Arc on Qwen3-1.7B

F4 — The original closed-loop A/B test

F17 — Why scaling at the final layer has zero effect

F18 — Additive intervention is the right actuator class at L=−1

F21 — Random additive in a closed loop is net-zero

F23 / F24 — Drift-opposing direction helps marginally and unreliably

F25 — Branchpoint hijacking: the actual mechanism behind F23/F24's gains

F26 — The measurement-layer confound

F27 — Layer-move is falsified once measured correctly

F28 — Synthesis: divergence measures prose surprise, not dynamical instability

§11 Branchpoint Hijacking: A Positive Finding

F25 — The mechanism, observed

F29 — Architectural generalization vs. consequence specificity

F31 — Branchpoint flippability is predictable from the clean trajectory

§12 The Mapping Program

Three open questions

Controller-return criteria

§13 Reproducibility Infrastructure

§14 Limitations and Open Questions

Development Context

Future Work: What v1's Validation Roadmap Became

§15 Conclusion