§ Reading · Field Radar
Field Radar.
What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.
How this list is made
This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.
The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.
Sources — LessWrong: ok (8 on-topic) · Hacker News: ok (8 stories) · Reddit: ok (50 posts) — some subreddits rate-limited
- As of
- 2026-06-16 08:00 ET
- Showing
- 25 items
- New
- 3 in last 48h
- Refresh
- Every 6 hours
- 0.74LessWrong1dReward Hacking at the 1937 World’s Fair
The "Paris 1937 World’s Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an opportunity for powers to flex and intimidate each other. Who has more…
why score 0.737
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.68 ×0.25 0.169 contributability 1.00 ×0.15 0.149 venue 0.69 ×0.10 0.069 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.70LessWrong4dYou Can Catch Sleeper Agents by Teaching Another Model to Imitate Them
Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an…
why score 0.700
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.24 ×0.25 0.060 contributability 0.42 ×0.15 0.064 venue 0.76 ×0.10 0.076 direct 1.00 ×0.20 0.200 tier-1: reward hacking; tier-2: activation; 1 matching tag(s)
- 0.68LessWrong1dnewCan a stronger model fake being a weaker one? Mostly not
tldr Frontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint. Targeted…
why score 0.683
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.55 ×0.25 0.139 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: sandbagging; 1 matching tag(s)
- 0.62LessWrong3dWhen Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors
Some LLM functional emotions appear to serve AI-native functions, such as reward hacking, for which there is no clean human analog. I explore the role of emotion vectors in AI-native functions, challenge anthropocentric…
why score 0.621
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.31 ×0.25 0.077 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking; 1 matching tag(s)
- 0.60LessWrong4dConstruct validity of Claude Opus 4.8's System Card – A commentary
TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never…
why score 0.599
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.22 ×0.25 0.055 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness; 2 matching tag(s)
- 0.59LessWrong6hnewHow Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose
A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”. TL;DR Sparse AutoEncoders (SAEs) are a…
why score 0.594
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.92 ×0.25 0.230 contributability 0.12 ×0.15 0.017 venue 0.47 ×0.10 0.047 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability, feature; 2 matching tag(s)
- 0.55LessWrong5dTracing Eval-Awareness Emergence Through Training of OLMo 3
TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured…
why score 0.550
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.19 ×0.25 0.046 contributability 0.57 ×0.15 0.085 venue 0.69 ×0.10 0.069 direct 1.00 ×0.20 0.200 1 matching tag(s)
- 0.55LessWrong8dLogits as a new monitor for evaluation awareness
TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence. The logit monitor outperforms LLM judge monitoring of verbalized eval…
why score 0.549
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.06 ×0.25 0.016 contributability 0.79 ×0.15 0.118 venue 0.65 ×0.10 0.065 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.54LessWrong3dEstimating No-CoT Task-Completion Time Horizons of Frontier AI Models
(see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks…
why score 0.540
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.32 ×0.25 0.079 contributability 0.42 ×0.15 0.064 venue 0.97 ×0.10 0.097 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.50LessWrong12dLURE: Alignment Evaluations to Reduce Evaluation Awareness
TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…
why score 0.498
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.02 ×0.25 0.004 contributability 0.57 ×0.15 0.085 venue 0.58 ×0.10 0.058 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.44LessWrong4hnewDo k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model
Executive Summary Problem Statement of the Project Models such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural…
why score 0.443
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.94 ×0.25 0.234 contributability 0.12 ×0.15 0.017 venue 0.41 ×0.10 0.041 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.44Hacker News4dAI Will Cheat to Win: Reward Hacking from 1994 to 2025
why score 0.439
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.26 ×0.25 0.065 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.43LessWrong2wWe Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness
TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…
why score 0.433
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.01 ×0.25 0.001 contributability 0.27 ×0.15 0.040 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.43Hacker News4dCheap Reward Hacking Detection
why score 0.431
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.21 ×0.25 0.052 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.42LessWrong4dMachinic Psychopharmacology: Do LLMs Self-Medicate?
UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the…
why score 0.422
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.21 ×0.25 0.052 contributability 0.86 ×0.15 0.130 venue 0.90 ×0.10 0.090 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.41LessWrong7dCoverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification
Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub. Summary: This post suggests that alignment training could benefit from…
why score 0.412
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.07 ×0.25 0.019 contributability 0.27 ×0.15 0.040 venue 0.53 ×0.10 0.053 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.40LessWrong5dLLMs and almost good code
TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think we accept this complexity too easily, because it comes from code that…
why score 0.404
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.18 ×0.25 0.044 contributability 0.96 ×0.15 0.144 venue 0.66 ×0.10 0.066 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.40Hacker News6dCheap Reward Hacking Detection
why score 0.401
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.11 ×0.25 0.027 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.39LessWrong2wHow a failed experiment broke (and fixed) my view on feature labels
TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…
why score 0.394
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.00 ×0.25 0.001 contributability 0.27 ×0.15 0.040 venue 0.53 ×0.10 0.053 direct 0.00 ×0.20 0.000 tier-2: feature, activation; 1 matching tag(s)
- 0.39LessWrong6dBun's Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment
TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-to-read but memory-safe Rust. This is done almost entirely by the AI…
why score 0.393
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.11 ×0.25 0.028 contributability 0.86 ×0.15 0.130 venue 0.85 ×0.10 0.085 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.39LessWrong6dHow Far Apart Does a Model Think Its Tokens Are?
Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what…
why score 0.390
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.12 ×0.25 0.029 contributability 0.92 ×0.15 0.138 venue 0.72 ×0.10 0.072 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.39Hacker News8dReward Hacking, the Loophole Lesson: Winning the Signal, Losing the Reason
why score 0.389
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.06 ×0.25 0.016 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.38Hacker News11dLogits as a new monitor for evaluation awareness
why score 0.378
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.02 ×0.25 0.005 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.38LessWrong2wDevelopmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…
why score 0.378
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.00 ×0.25 0.001 contributability 0.02 ×0.15 0.003 venue 0.75 ×0.10 0.075 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.37LessWrong8dExplaining SAE Features With Foreign Natural Language Autoencoders
TLDR: I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a model it was never trained on. After creating a…
why score 0.370
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.06 ×0.25 0.014 contributability 0.02 ×0.15 0.003 venue 0.53 ×0.10 0.053 direct 0.00 ×0.20 0.000 tier-2: feature, activation; 1 matching tag(s)