Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: ok (2 on-topic) · Hacker News: ok (8 stories) · Reddit: ok (25 posts) — some subreddits rate-limited

As of
2026-06-20 14:00 ET
Showing
25 items
New
1 in last 48h
Refresh
Every 6 hours
  1. 0.65LessWrong8d
    You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

    Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code] TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an…

    why score 0.654
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.06×0.250.015
    contributability0.42×0.150.064
    venue0.76×0.100.076
    direct1.00×0.200.200

    tier-1: reward hacking; tier-2: activation; 1 matching tag(s)

  2. 0.61LessWrong5d
    Reward Hacking at the 1937 World’s Fair

    The "Paris 1937 World’s Fair" was a dick measuring contest. At the time, the world was on the verge of the worst war in history. The fair was an opportunity for powers to flex and intimidate each other. Who has more…

    why score 0.609
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.16×0.250.041
    contributability1.00×0.150.149
    venue0.69×0.100.069
    direct1.00×0.200.200

    tier-1: reward hacking

  3. 0.60LessWrong2d
    If This Were a Test, How Much Would It Cost?

    TL;DR A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is…

    why score 0.599
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.38×0.250.094
    contributability0.96×0.150.144
    venue0.61×0.100.061
    direct0.00×0.200.000

    2 matching tag(s)

  4. 0.58LessWrong6d
    Can a stronger model fake being a weaker one? Mostly not

    tldr Frontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint. Targeted…

    why score 0.578
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.13×0.250.034
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: sandbagging; 1 matching tag(s)

  5. 0.56LessWrong7d
    When Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors

    Some LLM functional emotions appear to serve AI-native functions, such as reward hacking, for which there is no clean human analog. I explore the role of emotion vectors in AI-native functions, challenge anthropocentric…

    why score 0.563
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.07×0.250.019
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking; 1 matching tag(s)

  6. 0.56LessWrong8d
    Construct validity of Claude Opus 4.8's System Card – A commentary

    TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never…

    why score 0.557
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.05×0.250.013
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: evaluation awareness; 2 matching tag(s)

  7. 0.54LessWrong12d
    Logits as a new monitor for evaluation awareness

    TL;DR: We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence. The logit monitor outperforms LLM judge monitoring of verbalized eval…

    why score 0.537
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.02×0.250.004
    contributability0.79×0.150.118
    venue0.65×0.100.065
    direct1.00×0.200.200

    tier-1: evaluation awareness

  8. 0.52LessWrong9d
    Tracing Eval-Awareness Emergence Through Training of OLMo 3

    TL;DR Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured…

    why score 0.515
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.04×0.250.011
    contributability0.57×0.150.085
    venue0.69×0.100.069
    direct1.00×0.200.200

    1 matching tag(s)

  9. 0.49LessWrong2w
    LURE: Alignment Evaluations to Reduce Evaluation Awareness

    TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…

    why score 0.495
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.00×0.250.001
    contributability0.57×0.150.085
    venue0.58×0.100.058
    direct1.00×0.200.200

    tier-1: evaluation awareness

  10. 0.48LessWrong7d
    Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

    (see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks…

    why score 0.479
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.08×0.250.019
    contributability0.42×0.150.064
    venue0.97×0.100.097
    direct0.00×0.200.000

    2 matching tag(s)

  11. 0.46LessWrong3d
    1 Layer Induction Heads and Some Research

    Motivation Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount of time reading papers, reproducing…

    why score 0.455
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.28×0.250.070
    contributability0.27×0.150.040
    venue0.45×0.100.045
    direct0.00×0.200.000

    2 matching tag(s)

  12. 0.45LessWrong3hnew
    Why should AI be moral?

    I'm a philosopher and in this post, I’m extending a basic philosophical problem for humans to AGI and ASI. I am also proposing a speculative solution. My hope is that if there is a genuine problem here, that this post…

    why score 0.455
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.96×0.250.239
    contributability0.12×0.150.017
    venue0.48×0.100.048
    direct0.00×0.200.000

    1 matching tag(s)

  13. 0.43LessWrong2w
    We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

    TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…

    why score 0.432
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.00×0.250.000
    contributability0.27×0.150.040
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking

  14. 0.42LessWrong4d
    How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose
    why score 0.420
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.22×0.250.056
    contributability0.12×0.150.017
    venue0.47×0.100.047
    direct0.00×0.200.000

    tier-2: feature; 2 matching tag(s)

  15. 0.40LessWrong12d
    Coverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification

    Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub. Summary: This post suggests that alignment training could benefit from…

    why score 0.398
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.02×0.250.005
    contributability0.27×0.150.040
    venue0.53×0.100.053
    direct0.00×0.200.000

    2 matching tag(s)

  16. 0.39LessWrong5d
    VFUSE: Virulent Feature Understanding With Sparse AutoEncoders

    Abstract Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent…

    why score 0.392
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.16×0.250.039
    contributability0.02×0.150.003
    venue0.50×0.100.050
    direct0.00×0.200.000

    tier-2: mechanistic interpretability, feature; 1 matching tag(s)

  17. 0.39Hacker News8d
    Cheap Reward Hacking Detection
    why score 0.391
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.05×0.250.013
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct1.00×0.200.200

    tier-1: reward hacking

  18. 0.39Hacker News8d
    AI Will Cheat to Win: Reward Hacking from 1994 to 2025
    why score 0.389
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.06×0.250.016
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  19. 0.38LessWrong8d
    Machinic Psychopharmacology: Do LLMs Self-Medicate?

    UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the…

    why score 0.382
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.05×0.250.013
    contributability0.86×0.150.130
    venue0.90×0.100.090
    direct0.00×0.200.000

    1 matching tag(s)

  20. 0.38Hacker News10d
    Cheap Reward Hacking Detection
    why score 0.380
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.03×0.250.007
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  21. 0.38Hacker News12d
    Reward Hacking, the Loophole Lesson: Winning the Signal, Losing the Reason
    why score 0.377
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.02×0.250.004
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: reward hacking

  22. 0.37Hacker News2w
    Logits as a new monitor for evaluation awareness
    why score 0.374
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.00×0.250.001
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: evaluation awareness

  23. 0.37LessWrong10d
    Bun's Migration from Zig to Rust as a Potential Case Study for Gradual Disempowerment

    TL;DR: Bun is a very large and very influential open-source project. It is being migrated from the easier-to-read Zig programming language to harder-to-read but memory-safe Rust. This is done almost entirely by the AI…

    why score 0.372
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.03×0.250.007
    contributability0.86×0.150.130
    venue0.85×0.100.085
    direct0.00×0.200.000

    1 matching tag(s)

  24. 0.37LessWrong9d
    LLMs and almost good code

    TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think we accept this complexity too easily, because it comes from code that…

    why score 0.370
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.04×0.250.011
    contributability0.96×0.150.144
    venue0.66×0.100.066
    direct0.00×0.200.000

    1 matching tag(s)

  25. 0.37LessWrong10d
    How Far Apart Does a Model Think Its Tokens Are?

    Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what…

    why score 0.367
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.03×0.250.007
    contributability0.92×0.150.138
    venue0.72×0.100.072
    direct0.00×0.200.000

    1 matching tag(s)