Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: ok (8 on-topic) · Hacker News: ok (8 stories) · Reddit: ok (50 posts) — some subreddits rate-limited

As of
2026-06-16 08:00 ET
Showing
25 items
New
3 in last 48h
Refresh
Every 6 hours
  1. 0.74LessWrong1d
    Reward Hacking at the 1937 World’s Fair

    The "Paris 1937 World’s Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an opportunity for powers to flex and intimidate each other. Who has more…

    why score 0.737
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.68×0.250.169
    contributability1.00×0.150.149
    venue0.69×0.100.069
    direct1.00×0.200.200

    tier-1: reward hacking

  2. 0.70LessWrong4d
    You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

    Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an…

    why score 0.700
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.24×0.250.060
    contributability0.42×0.150.064
    venue0.76×0.100.076
    direct1.00×0.200.200

    tier-1: reward hacking; tier-2: activation; 1 matching tag(s)

  3. 0.68LessWrong1dnew
    Can a stronger model fake being a weaker one? Mostly not

    tldr Frontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint. Targeted…

    why score 0.683
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.55×0.250.139
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: sandbagging; 1 matching tag(s)

  4. 0.62LessWrong3d
    When Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors

    Some LLM functional emotions appear to serve AI-native functions, such as reward hacking, for which there is no clean human analog. I explore the role of emotion vectors in AI-native functions, challenge anthropocentric…

    why score 0.621
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.31×0.250.077
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking; 1 matching tag(s)

  5. 0.60LessWrong4d
    Construct validity of Claude Opus 4.8's System Card – A commentary

    TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never…

    why score 0.599
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.22×0.250.055
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: evaluation awareness; 2 matching tag(s)

  6. 0.59LessWrong6hnew
    How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose

    A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”. TL;DR Sparse AutoEncoders (SAEs) are a…

    why score 0.594
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.92×0.250.230
    contributability0.12×0.150.017
    venue0.47×0.100.047
    direct0.00×0.200.000

    tier-2: mechanistic interpretability, feature; 2 matching tag(s)

  7. 0.55LessWrong5d
    Tracing Eval-Awareness Emergence Through Training of OLMo 3

    TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured…

    why score 0.550
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.19×0.250.046
    contributability0.57×0.150.085
    venue0.69×0.100.069
    direct1.00×0.200.200

    1 matching tag(s)

  8. 0.55LessWrong8d
    Logits as a new monitor for evaluation awareness

    TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence. The logit monitor outperforms LLM judge monitoring of verbalized eval…

    why score 0.549
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.06×0.250.016
    contributability0.79×0.150.118
    venue0.65×0.100.065
    direct1.00×0.200.200

    tier-1: evaluation awareness

  9. 0.54LessWrong3d
    Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

    (see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks…

    why score 0.540
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.32×0.250.079
    contributability0.42×0.150.064
    venue0.97×0.100.097
    direct0.00×0.200.000

    2 matching tag(s)

  10. 0.50LessWrong12d
    LURE: Alignment Evaluations to Reduce Evaluation Awareness

    TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…

    why score 0.498
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.02×0.250.004
    contributability0.57×0.150.085
    venue0.58×0.100.058
    direct1.00×0.200.200

    tier-1: evaluation awareness

  11. 0.44LessWrong4hnew
    Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

    Executive Summary Problem Statement of the Project Models such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural…

    why score 0.443
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.94×0.250.234
    contributability0.12×0.150.017
    venue0.41×0.100.041
    direct0.00×0.200.000

    1 matching tag(s)

  12. 0.44Hacker News4d
    AI Will Cheat to Win: Reward Hacking from 1994 to 2025
    why score 0.439
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.26×0.250.065
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  13. 0.43LessWrong2w
    We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

    TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…

    why score 0.433
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.01×0.250.001
    contributability0.27×0.150.040
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking

  14. 0.43Hacker News4d
    Cheap Reward Hacking Detection
    why score 0.431
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.21×0.250.052
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct1.00×0.200.200

    tier-1: reward hacking

  15. 0.42LessWrong4d
    Machinic Psychopharmacology: Do LLMs Self-Medicate?

    UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the…

    why score 0.422
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.21×0.250.052
    contributability0.86×0.150.130
    venue0.90×0.100.090
    direct0.00×0.200.000

    1 matching tag(s)

  16. 0.41LessWrong7d
    Coverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification

    Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub. Summary: This post suggests that alignment training could benefit from…

    why score 0.412
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.07×0.250.019
    contributability0.27×0.150.040
    venue0.53×0.100.053
    direct0.00×0.200.000

    2 matching tag(s)

  17. 0.40LessWrong5d
    LLMs and almost good code

    TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think we accept this complexity too easily, because it comes from code that…

    why score 0.404
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.18×0.250.044
    contributability0.96×0.150.144
    venue0.66×0.100.066
    direct0.00×0.200.000

    1 matching tag(s)

  18. 0.40Hacker News6d
    Cheap Reward Hacking Detection
    why score 0.401
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.11×0.250.027
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  19. 0.39LessWrong2w
    How a failed experiment broke (and fixed) my view on feature labels

    TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…

    why score 0.394
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.00×0.250.001
    contributability0.27×0.150.040
    venue0.53×0.100.053
    direct0.00×0.200.000

    tier-2: feature, activation; 1 matching tag(s)

  20. 0.39LessWrong6d
    Bun's Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment

    TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-to-read but memory-safe Rust. This is done almost entirely by the AI…

    why score 0.393
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.11×0.250.028
    contributability0.86×0.150.130
    venue0.85×0.100.085
    direct0.00×0.200.000

    1 matching tag(s)

  21. 0.39LessWrong6d
    How Far Apart Does a Model Think Its Tokens Are?

    Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what…

    why score 0.390
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.12×0.250.029
    contributability0.92×0.150.138
    venue0.72×0.100.072
    direct0.00×0.200.000

    1 matching tag(s)

  22. 0.39Hacker News8d
    Reward Hacking, the Loophole Lesson: Winning the Signal, Losing the Reason
    why score 0.389
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.06×0.250.016
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  23. 0.38Hacker News11d
    Logits as a new monitor for evaluation awareness
    why score 0.378
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.02×0.250.005
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: evaluation awareness

  24. 0.38LessWrong2w
    Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

    Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…

    why score 0.378
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.00×0.250.001
    contributability0.02×0.150.003
    venue0.75×0.100.075
    direct0.00×0.200.000

    2 matching tag(s)

  25. 0.37LessWrong8d
    Explaining SAE Features With Foreign Natural Language Autoencoders

    TLDR: I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a model it was never trained on. After creating a…

    why score 0.370
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.06×0.250.014
    contributability0.02×0.150.003
    venue0.53×0.100.053
    direct0.00×0.200.000

    tier-2: feature, activation; 1 matching tag(s)