§ Reading · Field Radar
Field Radar.
What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.
How this list is made
This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.
The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.
Sources — LessWrong: ok (2 on-topic) · Hacker News: ok (8 stories) · Reddit: ok (25 posts) — some subreddits rate-limited
- As of
- 2026-06-20 14:00 ET
- Showing
- 25 items
- New
- 1 in last 48h
- Refresh
- Every 6 hours
- 0.65LessWrong8dYou Can Catch Sleeper Agents by Teaching Another Model to Imitate Them
Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an…
why score 0.654
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.06 ×0.25 0.015 contributability 0.42 ×0.15 0.064 venue 0.76 ×0.10 0.076 direct 1.00 ×0.20 0.200 tier-1: reward hacking; tier-2: activation; 1 matching tag(s)
- 0.61LessWrong5dReward Hacking at the 1937 World’s Fair
The "Paris 1937 World’s Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an opportunity for powers to flex and intimidate each other. Who has more…
why score 0.609
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.16 ×0.25 0.041 contributability 1.00 ×0.15 0.149 venue 0.69 ×0.10 0.069 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.60LessWrong2dIf This Were a Test, How Much Would It Cost?
TL;DR A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is…
why score 0.599
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.38 ×0.25 0.094 contributability 0.96 ×0.15 0.144 venue 0.61 ×0.10 0.061 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.58LessWrong6dCan a stronger model fake being a weaker one? Mostly not
tldr Frontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint. Targeted…
why score 0.578
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.13 ×0.25 0.034 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: sandbagging; 1 matching tag(s)
- 0.56LessWrong7dWhen Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors
Some LLM functional emotions appear to serve AI-native functions, such as reward hacking, for which there is no clean human analog. I explore the role of emotion vectors in AI-native functions, challenge anthropocentric…
why score 0.563
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.07 ×0.25 0.019 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking; 1 matching tag(s)
- 0.56LessWrong8dConstruct validity of Claude Opus 4.8's System Card – A commentary
TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never…
why score 0.557
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.05 ×0.25 0.013 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness; 2 matching tag(s)
- 0.54LessWrong12dLogits as a new monitor for evaluation awareness
TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence. The logit monitor outperforms LLM judge monitoring of verbalized eval…
why score 0.537
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.02 ×0.25 0.004 contributability 0.79 ×0.15 0.118 venue 0.65 ×0.10 0.065 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.52LessWrong9dTracing Eval-Awareness Emergence Through Training of OLMo 3
TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured…
why score 0.515
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.04 ×0.25 0.011 contributability 0.57 ×0.15 0.085 venue 0.69 ×0.10 0.069 direct 1.00 ×0.20 0.200 1 matching tag(s)
- 0.49LessWrong2wLURE: Alignment Evaluations to Reduce Evaluation Awareness
TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…
why score 0.495
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.00 ×0.25 0.001 contributability 0.57 ×0.15 0.085 venue 0.58 ×0.10 0.058 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.48LessWrong7dEstimating No-CoT Task-Completion Time Horizons of Frontier AI Models
(see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks…
why score 0.479
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.08 ×0.25 0.019 contributability 0.42 ×0.15 0.064 venue 0.97 ×0.10 0.097 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.46LessWrong3d1 Layer Induction Heads and Some Research
Motivation Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount of time reading papers, reproducing…
why score 0.455
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.28 ×0.25 0.070 contributability 0.27 ×0.15 0.040 venue 0.45 ×0.10 0.045 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.45LessWrong3hnewWhy should AI be moral?
I'm a philosopher and in this post, I’m extending a basic philosophical problem for humans to AGI and ASI. I am also proposing a speculative solution. My hope is that if there is a genuine problem here, that this post…
why score 0.455
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.96 ×0.25 0.239 contributability 0.12 ×0.15 0.017 venue 0.48 ×0.10 0.048 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.43LessWrong2wWe Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness
TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…
why score 0.432
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.00 ×0.25 0.000 contributability 0.27 ×0.15 0.040 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.42LessWrong4dHow Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose
why score 0.420
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.22 ×0.25 0.056 contributability 0.12 ×0.15 0.017 venue 0.47 ×0.10 0.047 direct 0.00 ×0.20 0.000 tier-2: feature; 2 matching tag(s)
- 0.40LessWrong12dCoverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification
Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub. Summary: This post suggests that alignment training could benefit from…
why score 0.398
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.02 ×0.25 0.005 contributability 0.27 ×0.15 0.040 venue 0.53 ×0.10 0.053 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.39LessWrong5dVFUSE: Virulent Feature Understanding With Sparse AutoEncoders
Abstract Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent…
why score 0.392
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.16 ×0.25 0.039 contributability 0.02 ×0.15 0.003 venue 0.50 ×0.10 0.050 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability, feature; 1 matching tag(s)
- 0.39Hacker News8dCheap Reward Hacking Detection
why score 0.391
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.05 ×0.25 0.013 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.39Hacker News8dAI Will Cheat to Win: Reward Hacking from 1994 to 2025
why score 0.389
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.06 ×0.25 0.016 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.38LessWrong8dMachinic Psychopharmacology: Do LLMs Self-Medicate?
UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the…
why score 0.382
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.05 ×0.25 0.013 contributability 0.86 ×0.15 0.130 venue 0.90 ×0.10 0.090 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.38Hacker News10dCheap Reward Hacking Detection
why score 0.380
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.03 ×0.25 0.007 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.38Hacker News12dReward Hacking, the Loophole Lesson: Winning the Signal, Losing the Reason
why score 0.377
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.02 ×0.25 0.004 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.37Hacker News2wLogits as a new monitor for evaluation awareness
why score 0.374
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.00 ×0.25 0.001 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.37LessWrong10dBun's Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment
TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-to-read but memory-safe Rust. This is done almost entirely by the AI…
why score 0.372
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.03 ×0.25 0.007 contributability 0.86 ×0.15 0.130 venue 0.85 ×0.10 0.085 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.37LessWrong9dLLMs and almost good code
TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think we accept this complexity too easily, because it comes from code that…
why score 0.370
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.04 ×0.25 0.011 contributability 0.96 ×0.15 0.144 venue 0.66 ×0.10 0.066 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.37LessWrong10dHow Far Apart Does a Model Think Its Tokens Are?
Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what…
why score 0.367
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.03 ×0.25 0.007 contributability 0.92 ×0.15 0.138 venue 0.72 ×0.10 0.072 direct 0.00 ×0.20 0.000 1 matching tag(s)