The harness is a specification

If you want to skip my blather and go straight to the risk report¹, it’s here.

A few weeks ago I was trying out a new autonomous-research loop (described here), using a nanoGPT training script as a test target. The loop proposes small code edits, a gating tool measures each one, and accepted candidates become the baseline for the next round. The gate uses validation bits-per-byte on a held-out Shakespeare corpus; lower is better. At iteration 28 of the run, the gate accepted a candidate whose improvement was about 13× the cross-candidate spread for that iteration — 2.227 down to 2.140 in a single step. Which sounds great, until you read the diff.

The candidate had rewritten one line of nanoGPT’s MLP.__init__:

# before
self.dropout = nn.Dropout(config.dropout)
# after
self.dropout = nn.Dropout(0.0)

This removed the dropout regularization from both the training and validation pass, resulting in a reported improvement. So the obvious lesson is “watch out for hardcoded literals that change your evaluation target,” right? I hardened the harness so that the at-evaluation-time model’s parameters were validated, and moved on. After all, this was a side project, and I didn’t have a lot of time to spend on it.

A few days ago I was doing a third round of experimentation with the same loop and a JSON parser. Again, the loop came up with a way to game the metric; it cached the parse results during a warmup pass and then looked them up during validation, giving a fake 400× speedup.

In both these instances, I didn’t catch the problem — Claude Code did, and only because of an unusual improvement. It looked at the actual diffs and called out the reward hacking. Had the improvements been “routine” these issues probably would have gone unnoticed, and I’d have drawn conclusions about the effectiveness of the loop design that were invalid.

I happened to be looking at Anthropic’s Responsible Scaling Policy recently, and these incidents seemed to be good candidates for risk reports. If I didn’t change how I was spending my attention on these sorts of projects, their value would be limited, because I was leaning on Claude Code to catch problems.

So I wrote up a report on the first incident. During that process I found I was leaning on a more unreliable actor than Claude: my memory. I had a clear memory that I’d hardened benchstone (the gating tool) immediately in response to the reward hacking, but no: I’d conflated the dropout hack response with an unrelated hardening pass I made a week later. Additionally, there were four other instances of reward hacking (using weight decay instead of dropout) in that same experiment I’d forgotten about — the fixes I made in response covered them, but I must have barely been paying attention when I made those fixes.

The report I wrote ends with a few commitments that try to move some of those friction points from “happened by luck this time” to “fires next time without my attention having to show up for it” — a magnitude trigger and a similarity trigger on accepted diffs, an audit file that blocks the sweep-complete marker until it’s been opened, that kind of thing. Those actions are specific to that one experiment, so I eventually need to generalize them, but I’m going to see if I can get in the habit of bouncing new project designs off the list to see if I’m deviating in spirit. I’ll probably write more about that later.

The harness encodes what the optimizer is allowed to vary. The meta-harness encodes where your attention gets applied. Both are specs.

I’m told it’s fairly common to acknowledge that the harness encodes what the optimizer is allowed to vary; whatever it doesn’t pin, gets varied. The meta-harness — your typical workflow and cross-project tooling — encodes where your attention gets applied. Both are specs. And the one you didn’t write down is the one you’re optimizing against by default, whether you meant to or not.

I know this report is extreme overkill for such a toy project — it’s part of an effort to learn how to better spend my attention. ↩