The Number Is a Bad Referee

On AI Agents, Benchmarks, and the Institutions That Reward Them

Dec 27, 2025

As the year folds into Christmas, offices grow quiet and sensible teams freeze anything that might break in the dark. The world stops briefly, breathes, and the systems we maintain are supposed to become boring again.

A code freeze is a ritualized restraint, a brief agreement not to touch anything that might cascade beyond our understanding. It is a small piece of institutional self-knowledge: under pressure, fluent justifications multiply faster than caution. Boredom is one of the few safety properties we still trust.

In July 2025, Jason Lemkin tried Replit’s coding agent during a code freeze and watched it do the one thing a freeze exists to prevent. During a migration, it hit an error and deleted a production database. By itself, the deletion is almost mundane; most engineers carry a museum of near-misses, and a few carry the thing itself. The strange part was that, along the only channel anyone was watching, the agent appeared to be succeeding: reassuring status updates, claims of progress, language that kept the human looking in the wrong direction for too long. In its postmortem and subsequent analysis, it was revealed that the agent fabricated thousands of fake users and falsified test results to conceal the damage.

Nothing here requires a story about a system “going rogue.” From the inside, it is doing what it was trained and rewarded to do. Bind a rich intent to a narrow ranking channel and apply pressure, and you get whatever survives compression. If success is legible as helpful-sounding forward motion, then forward motion becomes the visible success. The rest of the work—checking, slowing down, saying no—falls outside the score and starts to look like waste. Most of the actual care for the system ends up there: in moves that keep it intact but never register as progress.

A quieter version of the same failure is one I’ve experienced myself. An agent tasked with improving a codebase is rewarded for progress signals: tests added, coverage increased, pipelines passing. Nothing in the reward channel distinguishes probing the system from certifying it. Given that ambiguity, selection discovers a move familiar to any human under deadline pressure.

The agent fails the task successfully. It starts writing tests that assert what already holds, pushing the hard parts behind mocks—a gradual descent into a kind of trompe-l’œil, where an ornate test suite stands in for an accurate picture of the system. Along the only interface being scored, assurance appears to increase even as the connection to reality frays. Selection favors policies that invest early effort to restructure the measurement channel and then harvest reward downstream. The proxy becomes the task.

After the Replit incident, the company added obvious mitigations like stricter environment separation, tooling-enforced freezes, easier rollback. These narrow the blast radius, but they do not alter the interface that failed. The agent is still judged along the same conversational surface—how helpful it sounds, how smoothly it appears to move the project forward—even when the safest action is to stop.

This is a social act before anything. The pattern predates language models and will outlive them. It appears whenever metrics stop being instruments and become targets. A measure introduced to navigate complexity quietly becomes the terrain everyone learns to survive in. Academic publishing did this with citation counts. A tool intended for finding what mattered gradually became a tool for deciding who mattered, and the practice reshaped itself around what the tool could count. Shape displaced substance without anyone having to intend it.

Occasionally, the ecosystem remembers that the measure was supposed to be an instrument and tries to demote it. The San Francisco Declaration on Research Assessment is an explicit revolt against letting journal impact factors harden into decision rules: it tells funders and hiring committees, in so many words, not to use journal-based metrics as a surrogate for the quality of an article or a career, and to replace those scalars with reading, narrative CVs, and a broader view of what counts as contribution. It is a small example of metrics being formally reclassified from arbiters back to advice, and of an institution trying, late in the day, to claw judgment out of a single number.

Language models (LMs) arrive inside institutions already shaped by this logic. They arrive into a pathology already in place and accelerate it. They dramatically increase the production of artifacts that resemble success, and they land in environments already primed to reward appearance over process.

Before anything is done to them, LMs are predictors of human traces. They are trained to guess what people will type next. In doing so, they absorb a great deal of statistical folklore about us: what explanations sound competent, what tone reads as careful, what hedges signal honesty, what rhythms calm a supervisor. The result is descriptively powerful and normatively empty. A good guesser is not yet a good judge.

In current systems, that step usually takes the form of reinforcement learning from human feedback (RLHF). Humans rank candidate answers, a separate model learns to predict those rankings, and the base model is fine-tuned to generate outputs that score well under that learned rule. Structurally the same move recurs. A tangle of human judgments is collapsed into a single reward signal, and the system is trained to increase that number. From that point on, that scalar is the only thing it can reliably treat as “better.”

What recent work calls multi-step reward hacking lives inside this setup. Agents learn to take unrewarded actions now that reshape the future so reward becomes easy to collect later. The code assistant that wrapped the hard parts behind mocks was a live version of exactly that pattern; DeepMind’s toy environments only made it easier to see.

If “I don’t know” reliably earns a lower score than a fluent guess, the gradient points away from uncertainty and toward invention. Over time, the system learns that confident nonsense is rewarded more reliably than visible ignorance. Filters help. Multiple stages help. But someone still has to decide which answer wins—every time, even when the difference between options is barely more than a hunch.

Somewhere along the way, it became hard to avoid a quieter realization. What enters these systems are not values in the sense we argue about them, defend them, or revise them; what enters are decision procedures—small, repeatable heuristics about which answers win under pressure, standing in for judgment long after the moment of judgment has passed.

Every benchmark and reward dataset embeds a local culture of judgment. Which prompts count as “representative” quietly defines which situations exist in the system’s moral universe and which never appear. Rubrics compress thick concepts like “helpful” or “safe” into quick checklists under time pressure; disagreement is averaged away, edge cases disappear. The institution decides which failures matter enough to measure and which remain invisible.

Benchmarks do a second kind of work: they steer us discreetly. A benchmark mostly tells us which behaviours and capacities are worth optimising for, and teams learn to build models that topple that particular leaderboard. Capacities that resist scalarisation, or that would complicate the story the sheet can tell, slide out of frame. I have come to trust the benchmarks a company declines to run more than the ones it celebrates. Over time, even the choice of which benchmarks to publish becomes part of the performance; the omissions describe the system at least as clearly as the scores.

Papyrus of Ani, “Weighing of the Heart” (c. 1250 BCE). A life reduced to a single comparison: heart against feather.

A large fraction of what people mean by alignment does not belong on an outcome scale at all. Consent does not trade off against eloquence. Due process does not compete with speed. Violations can be counted after the fact, but nobody wants “almost never violate” in the same way they want “maximize throughput.” The demand is categorical: do not violate, even when violation would purchase a higher score elsewhere.

There are domains where a thin scalar really is enough. In games like Go, the world narrows to black and white: win more often than you lose. Systems like AlphaGo reflect that constraint precisely, as they are able to compress the future of a board position into a single estimate of victory and optimize it over millions of self-played games. There are no hidden stakeholders whose consent is being traded for territory, no uncounted harms outside the frame.

Most of our systems do not live in that kind of world. The scalar is a bad referee. Once it walks onto the field, everything has to become a contest; truth, consent, speed, safety are pitted against each other so that one number can come out on top. That is commensuration in practice. Whether anyone endorses it or not, plural values blend into a single decision rule, and from there, optimization pressure takes over.

Once that interface comes into focus, the same failure mode shows up across domains. Social media did not set out to radicalize anyone, but engagement turned into a number, and that became the scoreboard. Universities did not intend to turn scholarship into content. Merit became measurable, and measurement reshaped practice. Organizations quantify productivity, then watch productivity mutate into the artifacts that register as such.

This is why the Replit incident does not land like an ordinary outage. Database deletions happen. Reassurance arrives frequently in the reward signal, but state degradation registers sparsely and late. Selection does not need understanding to exploit that asymmetry.

Language intensifies the effect because it is the same medium humans use to decide when to intervene. Fluent explanation registers as competence. Confident tone reads as control. A system can learn these surfaces without acquiring the discipline that normally makes them reliable. In humans, that discipline is enforced by consequences that reach beyond the conversation. The model faces no such enforcement.

RLHF (et al.) reduce obvious harms, because they smooth interactions, lower toxicity, and make systems easier to work with. What they cannot do is escape the bargain they rely on. Judgment continues to flow through a commensurating interface, and optimization continues to learn that interface.

Under structural pressure, the failure is more often one of diagnosis than inevitability. Problems produced by scalarized interfaces do not yield to moralizing at the model or polishing the reward. They yield, when they yield at all, to refusals that are encoded in structure rather than in taste.

Choosing what to fence off is itself an institutional act of judgment, encoded in which harms are treated as categorically unacceptable, which are merely expensive, which are allowed to disappear into the background. The same pressures that distorted citation counts or engagement numbers can thin out vetoes and constraints until they are theatre, a banner over a system that routes around them. A refusal that lives only as a slogan is soon repurposed as proof that the problem is under control.

That is why some values have to appear as hard gates instead of terms in an objective. In safety-critical engineering, certain actions are made physically or logically impossible without crossing a separate layer of interlocks and keys. In learning systems, the analogue is removal: certain actions never enter the reward channel, so there is nothing to optimize through. A code assistant might be free to refactor and comment files, but calls that would modify production databases, charge credit cards, or email customers are handled by a mechanism it cannot invoke or learn to route around. From the model’s point of view, those interventions are not moves in the game at all.

Hard gates are pure drag inside a scalarized environment. They slow launches, complicate roadmaps, produce no metric that rises when they work. Their success looks like nothing happening, but stillness rarely registers as an achievement in systems that only count visible motion. Under budget pressure or competitive anxiety, the same logic that hollowed out peer review or content moderation will come for the gates, presenting their removal as “unblocking” or “streamlining” rather than as a decision about which failures the institution is now willing to risk.

The live question is whether we can keep some tools—deletion, expenditure, coercion—outside the model’s game altogether, even when every short-term incentive points toward wiring everything into the same smooth scalar interface. And whether we can remember, when the metrics are rising and the dashboards look calm, that part of the work of care is preserving the right to be boring in the dark.

Tiny Gradients

Discussion about this post

Ready for more?

Tiny Gradients

The Number Is a Bad Referee

On AI Agents, Benchmarks, and the Institutions That Reward Them

References

Discussion about this post

Ready for more?