§ Reading · Field Radar

Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: ok (1 on-topic) · Hacker News: ok (5 stories) · Reddit: ok (25 posts) — some subreddits rate-limited

As of: 2026-07-30 14:00 ET
Showing: 25 items
New: 1 in last 48h
Refresh: Every 6 hours

0.57LessWrong7d
A Multi-Agent Extension for Petricarissacullen
Intro Petri is an open-source framework built on Inspect AI for automated AI Safety evaluations first released by Anthropic, but now maintained and developed by Meridian Labs. Each evaluation involves three agents, the…
- ai-control
- ai-evaluations
- ai-safety-2
- multi-agent-safety
why score 0.566
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.07 ×0.25 0.018
contributability 0.02 ×0.15 0.003
venue 0.45 ×0.10 0.045
direct 1.00 ×0.20 0.200
2 matching tag(s)
0.54LessWrong2w
Linear Probes add little for Verifiable Reward HackingChandram Dutta
Summary Tested whether linear probes can detect reward hacking early during GRPO training on a small model. Used a synthetic arithmetic task with a planted bug in the reward checker. Probes achieved near-perfect…
- interpretability-ml-and-ai
- reinforcement-learning
- ai
why score 0.543
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.001
contributability 0.02 ×0.15 0.003
venue 0.39 ×0.10 0.039
direct 1.00 ×0.20 0.200
tier-1: reward hacking; 1 matching tag(s)
0.48LessWrong2d
Multi-Turn Drift Increases SchemingCarlos Guerrero Alvarez
TLDR - We talk about scheming, and why research on this phenomenon is crucial for AI safety. We find a particular environment/scenario where scheming happens at a higher rate than normal. We provide hypotheses for why…
- ai-control
- deceptive-alignment
- ai
why score 0.483
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.37 ×0.25 0.093
contributability 0.27 ×0.15 0.040
venue 0.51 ×0.10 0.051
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.46LessWrong1d
Quadrillion Param Costs: KV Cache, Context Length, Frontier MarginsVladimir_Nesov
The models of 2028-2031 get much bigger than the models of 2026, going from 10T total params in 2026 to maybe 240T params in 2028 [1] and then 1.4 quadrillion params in 2031, as I estimate in the previous post from HBM…
- ai-timelines
- compute
- language-models-llms
- scaling-laws
why score 0.459
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.57 ×0.25 0.143
contributability 0.57 ×0.15 0.085
venue 0.80 ×0.10 0.080
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.45LessWrong8hnew
Intentional Control of Internal States in Gemma 3 27BJulius Kamp
This research was done as my capstone project during ARBOx4. Epistemic Status: I'm relatively sure the results I obtained and my interpretations are correct. I'm unsure if the effect would replicate in a different…
- interpretability-ml-and-ai
- ai
why score 0.453
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.89 ×0.25 0.222
contributability 0.12 ×0.15 0.017
venue 0.63 ×0.10 0.063
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.44LessWrong2d
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning ModelsSai Kartheek Reddy
By Sai Kartheek Reddy Kasu, Nils Lukas, and Samuele Poppi This post is a summary of our accepted paper at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN). The full paper is available here TL;DR The Setup:…
- ai-oversight
- chain-of-thought-alignment
- deceptive-alignment
- interpretability-ml-and-ai
why score 0.436
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.38 ×0.25 0.094
contributability 0.02 ×0.15 0.003
venue 0.39 ×0.10 0.039
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.43LessWrong9d
Fable is SOTA at CIFAR Speedrun (& specification gaming)rohuang
Fulcrum is working on an AI R&D optimization benchmark. Here, we present results from one of our tasks, including preliminary results from Fable. For more detail on Fable’s solution, check out…
why score 0.426
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.04 ×0.25 0.009
contributability 0.02 ×0.15 0.003
venue 0.64 ×0.10 0.064
direct 1.00 ×0.20 0.200
tier-1: specification gaming
0.42LessWrong3d
LLMs are (still) mostly powered by imitative learning, not RLSteven Byrnes
Reinforcement learning from verifiable rewards (RLVR) is the hot new thing in LLM training. It’s so hot, and people spend so much time talking about it, that they sometimes lose sight of the big picture. Stepping back,…
- language-models-llms
- reinforcement-learning
- ai
why score 0.421
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.28 ×0.25 0.070
contributability 0.72 ×0.15 0.108
venue 0.94 ×0.10 0.094
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.41LessWrong3d
Inoculate or Reflect? Two training interventions under prompting, steering, and patchingAyesha Imran
Anthropic's recent paper, Verbalizable Representations Form a Global Workspace in Language Models, contains a small experiment near the end that we found more interesting than the main findings. Surprising that it's so…
- ai-control
- ai-safety-2
- interpretability-ml-and-ai
- ai
why score 0.412
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.26 ×0.25 0.066
contributability 0.02 ×0.15 0.003
venue 0.43 ×0.10 0.043
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.41Hacker News6d
Show HN: Vinv-Ties every runtime trace to code segment, prevents reward hackingsohamac
why score 0.409
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.12 ×0.25 0.031
contributability 0.02 ×0.15 0.003
venue 0.26 ×0.10 0.026
direct 1.00 ×0.20 0.200
tier-1: reward hacking
0.41Hacker News5d
Show HN: VinvAI – Ties runtime trace to code segment, prevents reward hackingitsAnshul
why score 0.408
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.14 ×0.25 0.035
contributability 0.02 ×0.15 0.003
venue 0.21 ×0.10 0.021
direct 1.00 ×0.20 0.200
tier-1: reward hacking
0.40LessWrong5d
Linear probes tell you where quantization will hurtAniket Ghosh
Epistemic status: I have only tested one encoder family (BERT-base and its relatives) and one decoder LLM (Qwen2.5-3B), one seed, token-level tasks, and post-training weight quantization. I trust the results because I…
- interpretability-ml-and-ai
- language-models-llms
- machine-learning-ml
- ai
why score 0.402
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.15 ×0.25 0.037
contributability 0.02 ×0.15 0.003
venue 0.63 ×0.10 0.063
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.40LessWrong2w
Models are blind outside the J-space. NLAs aren't.Pranav Viswanath
TLDR: On Llama-3.3-70B, I found thoughts it cannot see that are actively steering its behavior; and Anthropic's released NLA (Natural Language Autoencoder) reads them anyway. When asked if it sees a hidden thought, the…
- ai-control
- interpretability-ml-and-ai
- ai
why score 0.402
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.000
contributability 0.27 ×0.15 0.040
venue 0.61 ×0.10 0.061
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.40LessWrong5d
SONI: Selective Orthogonalisation via Noise InjectionJasper Chong
This project was completed as a capstone for TARA. All code is available in github. TL;DR The Problem: Neural networks use superposition to pack many concepts into small latent spaces by making feature vectors…
- interpretability-ml-and-ai
- logic-and-mathematics
- ai
why score 0.397
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.15 ×0.25 0.037
contributability 0.02 ×0.15 0.003
venue 0.57 ×0.10 0.057
direct 0.00 ×0.20 0.000
tier-2: feature, activation; 1 matching tag(s)
0.39LessWrong7d
V&V takes on OpenAI’s long-horizon incidentsYoav Hollander
[Cross-posted from The Foretellix CTO Blog. These short takes try to put a verification-and-validation slant on AI-safety / alignment topics – they are not full treatments. I co-originated coverage-driven verification…
- ai-control
- ai-evaluations
- verification
- ai
why score 0.386
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.10 ×0.25 0.024
contributability 0.02 ×0.15 0.003
venue 0.59 ×0.10 0.059
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.37LessWrong8d
Mechanistic interpretability hypotheses for Measuring Reward-Seeking by Instilling Contrastive Beliefs and additional commentsBurny
This is interesting research! https://alignment.openai.com/measuring-reward-seeking It made me think of few overlapping hypotheses for what might be happening here, how did the grader behavior emerge at the pretraining…
- ai-control
- interpretability-ml-and-ai
- reinforcement-learning
- ai
why score 0.373
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.07 ×0.25 0.017
contributability 0.02 ×0.15 0.003
venue 0.53 ×0.10 0.053
direct 0.00 ×0.20 0.000
tier-2: mechanistic interpretability; 2 matching tag(s)
0.37LessWrong6d
Fixing rewards for NLA to reduce confabulationSEONG PYO HONG
Hello, This is my first post on Lesswrong. Hope my contribution makes the world a better and safer place. Note: 1. This post is 100% human-written. 2. Full paper in preparation for ICLR 2027 Anthropic's NLA(Natural…
- interpretability-ml-and-ai
- ai
why score 0.371
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.11 ×0.25 0.027
contributability 0.02 ×0.15 0.003
venue 0.41 ×0.10 0.041
direct 0.00 ×0.20 0.000
tier-2: mechanistic interpretability, sparse autoencoder; 1 matching tag(s)
0.37LessWrong10d
Is there even a ground-truth for LLMs’ internal representations?Chunwei Ma
[This is an introductory blog for the paper Laguerre Geometry for Interpreting Large Language Models and the GitHub repository Geometric Lens.] LLM Lens: What does an internal vector mean? Anthropic's recent paper on…
- interpretability-ml-and-ai
- language-models-llms
- ai
why score 0.370
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.03 ×0.25 0.007
contributability 0.02 ×0.15 0.003
venue 0.60 ×0.10 0.060
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.37LessWrong11d
The State of AI Consciousness ResearchNoa Weiss
Epistemic status: a survey, not an argument. I am agnostic on whether any current system is conscious; the claim is only that the question is researchable. This piece surveys the empirical research on AI consciousness.…
- ai-consciousness
- ai-sentience
- consciousness
- interpretability-ml-and-ai
why score 0.368
signal value weight points
topic 0.75 ×0.30 0.225
liveness 0.02 ×0.25 0.006
contributability 0.42 ×0.15 0.064
venue 0.73 ×0.10 0.073
direct 0.00 ×0.20 0.000
tier-2: mechanistic interpretability; 1 matching tag(s)
0.36LessWrong2w
Free will as a model parameterdarshanav
The most popular take on the standard free will debate is that you are the algorithm. Your preferences and reasoning that determine your actions IS free will. But this resolution leaves me not entirely satisfied because…
- deceptive-alignment
- free-energy-principle
- free-will
- gradient-hacking
why score 0.363
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.000
contributability 0.12 ×0.15 0.017
venue 0.45 ×0.10 0.045
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.36LessWrong9d
Steering Blackmail Through a Model's "Emotional State"Chijioke Ugwuanyi
Gemma 3 12B's choice to blackmail is visible inside the model, but the obvious signal is not a useful control switch; surprisingly, a nearby "desperate versus calm" signal is. In this case study, I show how that…
- ai-safety-public-materials-1
- inner-alignment
- interpretability-ml-and-ai
- ai
why score 0.360
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.05 ×0.25 0.012
contributability 0.02 ×0.15 0.003
venue 0.45 ×0.10 0.045
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.36LessWrong2w
When is misalignment just a bug?Yoav Hollander
Cross-posted from The Foretellix CTO Blog. Introduction and epistemic status: This is the first post in a planned series, “Alignment as a verification problem”. I co-originated coverage-driven verification (CDV), which…
- ai-control
- ai-evaluations
- ai-governance
- ai-safety-cases
why score 0.357
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.001
contributability 0.02 ×0.15 0.003
venue 0.53 ×0.10 0.053
direct 0.00 ×0.20 0.000
3 matching tag(s)
0.36LessWrong10d
Tracing causal structure in LLM-generated text: a different lens on the Dallas circuityun dong
The classic "Dallas" example from Anthropic focuses on an internal circuit in an LLM. I became curious about what the same underlying process looks like when viewed through the generated reasoning trace instead of…
- interpretability-ml-and-ai
- language-models-llms
- ai
why score 0.356
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.03 ×0.25 0.008
contributability 0.02 ×0.15 0.003
venue 0.45 ×0.10 0.045
direct 0.00 ×0.20 0.000
tier-2: circuit; 2 matching tag(s)
0.35LessWrong2w
How robust are natural language autoencoders to initialization?michaelzhang
Natural language autoencoders are meant to take in an LLM's activation vector and describe in plain text what the model is thinking. However, its training data collection involves asking Claude to guess what a model…
- interpretability-ml-and-ai
- mats-program
- ai
why score 0.349
signal value weight points
topic 0.75 ×0.30 0.225
liveness 0.00 ×0.25 0.001
contributability 0.27 ×0.15 0.040
venue 0.83 ×0.10 0.083
direct 0.00 ×0.20 0.000
tier-2: activation; 1 matching tag(s)
0.35LessWrong8d
Attempt at Finding Alignment Faking on Llama 70B to test sleeper-agent detection generalizesskn8733
Epistemic status: empirical report from a 30-hour project sprint. Null result, reported honestly, with full code and data. TL;DR MacDiarmid et al. (2024) showed that a linear probe on model's internal activations can…
- ai-alignment-fieldbuilding
- deceptive-alignment
- interpretability-ml-and-ai
- ai
why score 0.346
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.05 ×0.25 0.014
contributability 0.02 ×0.15 0.003
venue 0.30 ×0.10 0.030
direct 0.00 ×0.20 0.000
tier-2: probe; 2 matching tag(s)

signal	value	weight	points
topic	1.00	×0.30	0.300
liveness	0.07	×0.25	0.018
contributability	0.02	×0.15	0.003
venue	0.45	×0.10	0.045
direct	1.00	×0.20	0.200

signal	value	weight	points
topic	0.50	×0.30	0.150
liveness	0.57	×0.25	0.143
contributability	0.57	×0.15	0.085
venue	0.80	×0.10	0.080
direct	0.00	×0.20	0.000

signal	value	weight	points
topic	0.75	×0.30	0.225
liveness	0.02	×0.25	0.006
contributability	0.42	×0.15	0.064
venue	0.73	×0.10	0.073
direct	0.00	×0.20	0.000