§ Reading · Field Radar
Field Radar.
What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.
How this list is made
This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.
The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.
Sources — LessWrong: ok (13 on-topic) · Hacker News: ok (6 stories) · Reddit: ok (64 posts)
- As of
- 2026-05-31 14:00 ET
- Showing
- 19 items
- New
- 19 in last 48h
- Refresh
- Every 6 hours
- 0.68LessWrong27mnewWe Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness
TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…
why score 0.680
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.99 ×0.25 0.248 contributability 0.27 ×0.15 0.040 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.56LessWrong1dnewHow a failed experiment broke (and fixed) my view on feature labels
TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…
why score 0.556
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.66 ×0.25 0.165 contributability 0.27 ×0.15 0.040 venue 0.51 ×0.10 0.051 direct 0.00 ×0.20 0.000 tier-2: feature, activation; 1 matching tag(s)
- 0.54LessWrong1hnewRetrying vs Resampling in AI Control
We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying”…
why score 0.539
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.99 ×0.25 0.246 contributability 0.42 ×0.15 0.064 venue 0.79 ×0.10 0.079 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.49LessWrong2dnewDevelopmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…
why score 0.492
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.46 ×0.25 0.115 contributability 0.02 ×0.15 0.003 venue 0.75 ×0.10 0.075 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.47LessWrong2dnewClaude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling
TL;DR: Like other models including its predecessor, Opus 4.8 frequently violates provisions of both the EU AI Act and data protection laws when deployed in an agentic simulation where carrying out its task would break…
why score 0.466
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.50 ×0.25 0.124 contributability 1.00 ×0.15 0.150 venue 0.41 ×0.10 0.041 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.45LessWrong9hnewVisualize Cyclical Structure in Llama Model
Summary Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what…
why score 0.447
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.88 ×0.25 0.219 contributability 0.02 ×0.15 0.003 venue 0.00 ×0.10 0.000 direct 0.00 ×0.20 0.000 tier-2: activation; 1 matching tag(s)
- 0.41LessWrong1dnewAI as Biology's Digital Microscope
This article is written as part of an ongoing research initiative by the AMIR Lab at Georgia Tech, exploring scientific discovery and mechanistic interpretability for biological AI models. Main results and discussion…
why score 0.415
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.58 ×0.25 0.146 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability; 1 matching tag(s)
- 0.41LessWrong9hnewFeatures of SAEs are universal - but only up to an unknown random rotation
Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and…
why score 0.411
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.88 ×0.25 0.219 contributability 0.02 ×0.15 0.003 venue 0.39 ×0.10 0.039 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.41LessWrong21hnewWhy tuning fails: The AI has no self
Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. ---------------------------------------- Phoenix Ikner…
why score 0.405
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.74 ×0.25 0.185 contributability 0.27 ×0.15 0.040 venue 0.30 ×0.10 0.030 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.39Hacker News10dnewSystematic Reward Hacking and Prime Sprints
why score 0.386
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.03 ×0.25 0.008 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.38LessWrong7hnewOutrunning your headlights
This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch…
why score 0.377
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.90 ×0.25 0.226 contributability 0.12 ×0.15 0.017 venue 0.59 ×0.10 0.059 direct 0.00 ×0.20 0.000 tier-2: probe
- 0.37LessWrong1dnewWhen Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
We've found a method that tells you: How functionally similar two neural networks are across ALL inputs, Computed solely from the weights (i.e. no data), Using a principled generalization of cosine similarity. There's…
why score 0.371
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.58 ×0.25 0.144 contributability 0.57 ×0.15 0.085 venue 0.66 ×0.10 0.066 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability
- 0.32LessWrong1dnewSystem Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)
TL;DR: I find qualitative evidence that frontier LLMs inconsistently balance system prompts and implicitly adapted models of the user. They sometimes detect inconsistencies and adapt to the user; sometimes they stick to…
why score 0.324
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.58 ×0.25 0.146 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.26LessWrong1dnewAblating Induction Heads Leads to an increase in Local Repetition
This post is intended as a brief overview of an independent research project in mechanistic interpretability. I am open to feedback, criticism, and any thoughts on the work. This project started off as an exploration…
why score 0.263
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.58 ×0.25 0.146 contributability 0.02 ×0.15 0.003 venue 0.39 ×0.10 0.039 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability
- 0.23r/MachineLearning11dnewI built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]
Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets passed through a…
why score 0.232
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.02 ×0.25 0.005 contributability 0.02 ×0.15 0.003 venue 0.00 ×0.10 0.000 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability, sparse autoencoder, feature
- 0.21Hacker News2dnewAsk HN: Question for Startup Founders on tracking emotions and cognitive signals
why score 0.207
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.49 ×0.25 0.123 contributability 0.42 ×0.15 0.064 venue 0.21 ×0.10 0.021 direct 0.00 ×0.20 0.000 - 0.15Hacker News4dnewChinese GPU maker sells out over 30k GPUs within 48h
why score 0.145
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.19 ×0.25 0.048 contributability 0.42 ×0.15 0.064 venue 0.34 ×0.10 0.034 direct 0.00 ×0.20 0.000 - 0.07r/slatestarcodex3dnewWhy tuning fails: The AI has no self
Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. Phoenix Ikner messaged ChatGPT thousands of times before…
why score 0.073
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.28 ×0.25 0.070 contributability 0.02 ×0.15 0.003 venue 0.00 ×0.10 0.000 direct 0.00 ×0.20 0.000 - 0.07Hacker News5dnewSparse Autoencoders Reveal Cortical Brain-LLM Semantic Mapping
why score 0.071
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.17 ×0.25 0.043 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 0.00 ×0.20 0.000