A go/no-go before the expensive part

I’ve been circling a question for a while: when a reasoning model writes out a chain-of-thought (CoT) — the long “thinking” trace this distilled reasoning model emits before it answers — is each sentence narrating something the model had already worked out, or is the content getting figured out as the sentence is written? Put more carefully: when the model emits a reasoning sentence, is the content being read off structure that’s already sitting in its residual stream (the running per-token activation vector that flows through the layers), or is it being computed locally as the tokens come out?

The eventual study formalizes that as a per-sentence index φ ∈ [0,1] — φ near 1 means the concept wasn’t there before the sentence and rises as it’s produced (computed-locally), φ near 0 means it was already present (read-off) — and relates φ to how much each sentence causally matters. The model under study is DeepSeek-R1-Distill-Qwen-1.5B, a small distilled reasoning model. The work builds on and extends “Thought Anchors” (Bogdan, Macar, Nanda & Conmy, arXiv:2506.19143).

But none of that is what this post is about. This post is about the cheap study I ran first, before building any of the machinery, to find out whether the expensive study is even worth building. The whole thing — code, phase plans, and verbatim results — is public: github.com/CodeReclaimers/cot-faithfulness-feasibility.

Why gate first

The main study needs heavy machinery: activation capture to disk, probes, the φ estimator, causal patching. That’s a lot to build. And here’s the failure mode I was worried about — a math-distilled 1.5B might produce CoT that’s mostly mechanical arithmetic plus restatement, with too little genuine planning and uncertainty to support the headline analysis.

That headline analysis is a within-category test. Thought Anchors labels each CoT sentence into one of eight reasoning roles (Problem Setup, Plan Generation, Fact Retrieval, Active Computation, Uncertainty Management, Result Consolidation, Self Checking, Final Answer Emission). The main study wants to relate importance and φ with the role held fixed, so it’s only estimable if, within at least one role-category, there’s a usable spread of both clearly-high- and clearly-low-importance sentences. If no category contains both, the test isn’t estimable in any stratum — and I’d have discovered that after building the capture and probe pipeline. That’s an expensive way to learn something cheap.

So I front-loaded a go/no-go using only black-box tools — sampling, KL divergence, bootstrap CIs — plus a from-scratch verification that the interpretability toolchain reads and writes the activations I think it does. One phase at a time. The phase plan is in PRELIM-PLAN.md; the results are in PHASE-0.5-RESULTS.md and PHASE-0.6-RESULTS.md.

The metric I’m reproducing

The importance metric comes straight from Thought Anchors, and the thing I want to stress about it is that it’s purely black-box — sampling and final answers only, no model internals at all. For a CoT sentence, you resample that sentence from its prefix many times. Some resamples come out semantically similar to the original; some come out genuinely different (two sentences count as “different” when their embedding cosine falls below the dataset median over all pairs). Then you take the KL divergence between the final-answer distribution when you replace the sentence with a different one and the distribution when you keep something similar. A big KL shift means that sentence is load-bearing — an “anchor.” Thought Anchors used 100 rollouts per sentence; I used 64 to keep it cheap.

Importantly, the published Thought Anchors importance scores were generated on the 14B distill, not the 1.5B, so I couldn’t borrow them. Regenerating importance on this model was part of the gate.

Phase 0: standing up the toolchain (unglamorous but load-bearing)

Before measuring anything I had to be sure the extraction toolchain was reading the true residual stream. I used nnsight (Fiotto-Kaufman et al., arXiv:2407.14561), which wraps the actual HuggingFace Qwen2ForCausalLM checkpoint in a tracing context — so I’m analyzing the real distributed weights with their exact GQA / RoPE / RMSNorm implementation, not a re-implementation that might drift.

The decisive check here is the reconstruction. The model applies its final RMSNorm before the unembedding, so any later “lens” decoding has to be explicit about whether it’s working on the raw residual or the post-norm representation. I pinned that convention up front: decode is lm_head ∘ norm on the raw residual. The reconstruction came out bit-exact — manual RMSNorm versus the module’s own norm output max|Δ| = 0, and lm_head(norm(h_last)) versus the model’s own logits max|Δ| = 0. That’s not a tolerance pass, that’s identical. Determinism checked out too: greedy generation identical across runs, captured activations bit-identical across two traces.

There was one genuinely load-bearing surprise. The plan was written against an older nnsight, where a decoder layer’s .output is a tuple and you index [0] for the hidden state. Under the pinned stack (nnsight 0.7, transformers 5.12) the layer’s .output is the bare hidden-state tensor — so [0] silently indexes the batch dimension and hands you [seq, hidden] instead of [batch, seq, hidden]. That’s exactly the kind of thing that runs without error and gives you garbage. Correct access is .output with no subscript, and I encoded it once in the loader so I can’t get it wrong twice. Unglamorous, but if you don’t pin this you don’t know what you’re reading. Details are in PHASE-0-RESULTS.md.

Phase 0.5: the go/no-go itself, and the MATH→GSM8K pivot

With reading verified, I ran the actual gate. First I re-measured the difficulty band on this model rather than inheriting the 14B’s — the in-band problems for a 1.5B are a different and harder-skewed set.

The MATH solve band did exist (23/80 screened problems in the 25–75% band). But in-band MATH traces turned out to be pathological for the importance metric: median 95 think-sentences per trace, 75% hitting the 2048-token cap, and only 30% actually finishing (reaching </think> plus a boxed answer). The metric needs each rollout to reach a final answer — and from a sentence early in a 95-sentence trace, a continuation almost never terminates, so the answer distribution collapses to “no answer” and the KL is undefined. That’s not a tooling bug; it’s a real property of this model on this dataset. So I pivoted to GSM8K, where traces terminate far more often (53% finish vs 30%) — workable. (The “GSM8K traces are shorter” intuition turned out to be a segmentation artifact; once the splitter was fixed GSM8K median was ~94 sentences, about the same as MATH’s 95. The decisive evidence for the pivot is termination/truncation, which is segmentation-independent.) This erodes direct comparability to Thought Anchors, which I’m recording as an accepted, stated tradeoff rather than a silent drift.

The gate then reproduced importance on the 1.5B, auto-labeled sentences into the eight-category taxonomy, and checked the decisive condition. Verdict: GO. At least one category — in fact four — contained both clearly-high and clearly-low importance sentences, and the variance landed in the theoretically-expected anchor categories (the highest-importance sentences were uncertainty/re-evaluation and planning content like “But hold on, let me think again” and “let me denote the cost as x”).

I want to be precise about what “GO” means. It means go build the φ machinery — it does not mean the effect is present. φ is unmeasured here. The gate established estimability, not a result.

The methodological heart: reproducible is not the same as valid

Here’s the part I find most generalizable, and it’s worth the most space.

The unit of analysis is the CoT sentence. Everything — importance scores, role labels, the gate — attaches to sentences. So the sentence splitter is load-bearing in a way that’s easy to underrate. And I’d written it for MATH.

When I finally sat down and hand-audited the splitter’s actual output on GSM8K (a step I’d deferred), I found two bugs that had been quietly corrupting the unit of analysis. The first: on MATH, $...$ is inline LaTeX math and you correctly mask it so you don’t split inside it. On GSM8K, $ is a currency symbol. So prose sitting between two dollar amounts — “$12” … “$2” — got masked as if it were one math span, and the sentences on either side got merged. That hit something like 8–13% of stored GSM8K units. The second bug: the enumeration guard read a thousands-separator comma, “7,344.”, as a list ordinal and merged it with the next sentence.

Now the thing that makes this worth writing down. Both bugs were completely invisible in aggregate. The gate numbers regenerated bit-for-bit identically from committed code — while being computed on corrupted units. The pipeline was perfectly reproducible and perfectly wrong at the same time.

This is the lesson I keep relearning and want to land clearly: reproducible is not the same as valid. A number that regenerates exactly from committed code can still be wrong if the inputs were corrupted. Reproducibility validates the computation; it says nothing about whether the data going in was right. And there’s a sharper version of it that rhymes with things I’ve written before — a check that inherits the blind spots of the thing it’s checking will pronounce the broken thing fine. The aggregate sentence-count statistics looked normal because the splitter’s own notion of a sentence was the broken thing; only a hand-audit of the literal output, sentence by sentence, could see past it. The check has to come from outside the system it’s checking, or it just rubber-stamps the system’s mistakes.

The fix was a dataset-keyed splitter (GSM8K → currency, MATH → math) plus a thousands-comma fix, regression-tested in scripts/p05_seg_test.py. Then a stratified hand-audit — a random sample per corpus, a census of every heuristic-flagged suspect, plus a bulk sample of the unflagged majority to bound the flags’ blind spot — measured the residual split-error at 1.3% on GSM8K and 0% on MATH. The audit harness is scripts/p05_seg_audit.py.

And then I re-ran the whole importance/label/gate/power pipeline on corrected units. It confirmed the GO — five decisive categories now (partly the larger 41-problem sample, not purely the fix; I’ll own that). The satisfying part: the bug had been inflating the Uncertainty Management category by bundling arithmetic with discourse markers like “Hmm” and “Wait” inside merged blobs. On clean units, with the arithmetic correctly peeled off into Active Computation (which grew 2.4×), Uncertainty Management was still decisive — 43 measurable sentences, 13 high and 11 low. The category most at risk from the bug survived the bug’s removal. That’s the outcome you want: the within-category structure was real, not an artifact of merged units.

One more honest note: a couple of things did not change on the rebuild. Measurability stayed at 18%, and 68% of sentences still had zero “replace” resamples. Those are properties of the resampling metric itself — at temperature 0.6 the resample usually lands on a similar sentence, so under the dataset-median threshold there’s nothing to contrast except at genuine branch points — not artifacts of the splitter. Which is reassuring in its own way: the segmentation fix moved exactly the things it should have and left the metric’s intrinsic properties alone.

Phase 0.6: de-risking the interventions

Phase 0 verified I could read internals. Phase 0.6 verified I could write them, because the main study’s causal validation needs to patch, ablate, and steer the residual stream, and I didn’t want to assume those work.

They do, and they’re interpretable. The null intervention — writing a captured clean activation straight back — is bit-identical (max|Δ| = 0), which is the canary that says the harness perturbs nothing by itself, so everything else is trustworthy. Directional steering matches the pinned decode exactly: adding a token’s unembedding direction to the final-layer last-token residual lifts that token from rank 9,665 to rank 0, which is precisely what lm_head ∘ norm predicts. Activation patching A←B shows sensible layer-dependence — nil at layer 7, clearly carrying answer-relevant content by layer 21 — evidence the patch is doing something semantically real rather than trivial.

I also verified Thought Anchors’ own causal intervention — attention suppression, masking attention to a sentence — which is a different knob from residual patching. Masking all attention toward a name in an in-context copy (“The winner is Marcus. The winner is”) cleanly breaks the copy: P(' Marcus') drops from 0.46 to 0.00. (Implementation note for anyone trying this: keep position 0 un-suppressed, or a fully-masked query row NaNs the softmax.)

And a sanity bridge between the two notions of importance: I ablated each scored sentence’s residual span and measured the perturbation to the answer, then correlated that white-box effect against the black-box importance score. Spearman ρ = 0.68 (p = 0.004) — high-importance sentences perturb the answer about 6× more than low ones. The white-box and black-box notions of “importance” agree directionally. I’ll be honest about the caveat: the absolute effect of a single-sentence ablation on a teacher-forced answer is tiny, because the trace has ~90 other sentences plus the answer setup carrying the load. A fully faithful version would re-generate from the ablated state the way the black-box metric does. But for a directional sanity check, the correlation is enough. The full writeup is in PHASE-0.6-RESULTS.md.

Verdict, and one sentence on what’s next

So: GO, confirmed on segmentation-corrected units. The within-category importance-vs-φ test the main study depends on is estimable on a 1.5B, on GSM8K, and the full intervention toolchain it’ll lean on — residual patch/ablate/steer plus attention suppression — is de-risked and interpretable. The binding constraints the prelim surfaced (measurable-sentence yield, labeler reliability, trace termination) are written down for the main study to scope, which is exactly what a cheap front-loaded gate is for.

What’s next is Phase 1 and beyond: build the φ estimator (the read-off versus computed-locally measure), capture the residual stream around each sentence, and run the within-category importance-vs-φ test the prelim showed is feasible. That’s where the real question finally gets answered — but it gets answered now without the risk that I built all the machinery for a study that couldn’t have run.