Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: ok (13 on-topic) · Hacker News: ok (6 stories) · Reddit: ok (64 posts)

As of
2026-05-31 14:00 ET
Showing
19 items
New
19 in last 48h
Refresh
Every 6 hours
  1. 0.68LessWrong27mnew
    We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

    TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…

    why score 0.680
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.99×0.250.248
    contributability0.27×0.150.040
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking

  2. 0.56LessWrong1dnew
    How a failed experiment broke (and fixed) my view on feature labels

    TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…

    why score 0.556
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.66×0.250.165
    contributability0.27×0.150.040
    venue0.51×0.100.051
    direct0.00×0.200.000

    tier-2: feature, activation; 1 matching tag(s)

  3. 0.54LessWrong1hnew
    Retrying vs Resampling in AI Control

    We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying”…

    why score 0.539
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.99×0.250.246
    contributability0.42×0.150.064
    venue0.79×0.100.079
    direct0.00×0.200.000

    1 matching tag(s)

  4. 0.49LessWrong2dnew
    Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

    Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…

    why score 0.492
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.46×0.250.115
    contributability0.02×0.150.003
    venue0.75×0.100.075
    direct0.00×0.200.000

    2 matching tag(s)

  5. 0.47LessWrong2dnew
    Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

    TL;DR: Like other models including its predecessor, Opus 4.8 frequently violates provisions of both the EU AI Act and data protection laws when deployed in an agentic simulation where carrying out its task would break…

    why score 0.466
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.50×0.250.124
    contributability1.00×0.150.150
    venue0.41×0.100.041
    direct0.00×0.200.000

    1 matching tag(s)

  6. 0.45LessWrong9hnew
    Visualize Cyclical Structure in Llama Model

    Summary Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what…

    why score 0.447
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.88×0.250.219
    contributability0.02×0.150.003
    venue0.00×0.100.000
    direct0.00×0.200.000

    tier-2: activation; 1 matching tag(s)

  7. 0.41LessWrong1dnew
    AI as Biology's Digital Microscope

    This article is written as part of an ongoing research initiative by the AMIR Lab at Georgia Tech, exploring scientific discovery and mechanistic interpretability for biological AI models. Main results and discussion…

    why score 0.415
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.58×0.250.146
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct0.00×0.200.000

    tier-2: mechanistic interpretability; 1 matching tag(s)

  8. 0.41LessWrong9hnew
    Features of SAEs are universal - but only up to an unknown random rotation

    Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and…

    why score 0.411
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.88×0.250.219
    contributability0.02×0.150.003
    venue0.39×0.100.039
    direct0.00×0.200.000

    1 matching tag(s)

  9. 0.41LessWrong21hnew
    Why tuning fails: The AI has no self

    Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. ---------------------------------------- Phoenix Ikner…

    why score 0.405
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.74×0.250.185
    contributability0.27×0.150.040
    venue0.30×0.100.030
    direct0.00×0.200.000

    1 matching tag(s)

  10. 0.39Hacker News10dnew
    Systematic Reward Hacking and Prime Sprints
    why score 0.386
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.03×0.250.008
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct1.00×0.200.200

    tier-1: reward hacking

  11. 0.38LessWrong7hnew
    Outrunning your headlights

    This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch…

    why score 0.377
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.90×0.250.226
    contributability0.12×0.150.017
    venue0.59×0.100.059
    direct0.00×0.200.000

    tier-2: probe

  12. 0.37LessWrong1dnew
    When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    We've found a method that tells you: How functionally similar two neural networks are across ALL inputs, Computed solely from the weights (i.e. no data), Using a principled generalization of cosine similarity. There's…

    why score 0.371
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.58×0.250.144
    contributability0.57×0.150.085
    venue0.66×0.100.066
    direct0.00×0.200.000

    tier-2: mechanistic interpretability

  13. 0.32LessWrong1dnew
    System Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)

    TL;DR: I find qualitative evidence that frontier LLMs inconsistently balance system prompts and implicitly adapted models of the user. They sometimes detect inconsistencies and adapt to the user; sometimes they stick to…

    why score 0.324
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.58×0.250.146
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct0.00×0.200.000

    1 matching tag(s)

  14. 0.26LessWrong1dnew
    Ablating Induction Heads Leads to an increase in Local Repetition

    This post is intended as a brief overview of an independent research project in mechanistic interpretability. I am open to feedback, criticism, and any thoughts on the work. This project started off as an exploration…

    why score 0.263
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.58×0.250.146
    contributability0.02×0.150.003
    venue0.39×0.100.039
    direct0.00×0.200.000

    tier-2: mechanistic interpretability

  15. 0.23r/MachineLearning11dnew
    I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

    Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets passed through a…

    why score 0.232
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.02×0.250.005
    contributability0.02×0.150.003
    venue0.00×0.100.000
    direct0.00×0.200.000

    tier-2: mechanistic interpretability, sparse autoencoder, feature

  16. 0.21Hacker News2dnew
    Ask HN: Question for Startup Founders on tracking emotions and cognitive signals
    why score 0.207
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.49×0.250.123
    contributability0.42×0.150.064
    venue0.21×0.100.021
    direct0.00×0.200.000
  17. 0.15Hacker News4dnew
    Chinese GPU maker sells out over 30k GPUs within 48h
    why score 0.145
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.19×0.250.048
    contributability0.42×0.150.064
    venue0.34×0.100.034
    direct0.00×0.200.000
  18. 0.07r/slatestarcodex3dnew
    Why tuning fails: The AI has no self

    Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. Phoenix Ikner messaged ChatGPT thousands of times before…

    why score 0.073
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.28×0.250.070
    contributability0.02×0.150.003
    venue0.00×0.100.000
    direct0.00×0.200.000
  19. 0.07Hacker News5dnew
    Sparse Autoencoders Reveal Cortical Brain-LLM Semantic Mapping
    why score 0.071
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.17×0.250.043
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct0.00×0.200.000