Jeanne has been building language models since before it was cool.

With nearly ten years of experience in AI, multilingual NLP, and data science, spanning both industry and research, her focus has always been on multilingual and low-resource settings, where data is scarce, noisy, and rarely benchmark-ready. Her work has spanned misinformation detection on social media and real-world language understanding in underrepresented languages, with privacy and data protection as a consistent consideration throughout. More recently, her interests have extended into finance, specifically how language models can be used to extract signal from social media and news to model and anticipate market behaviour.

Her research has been published at ACL, including work on automating multilingual healthcare question answering in low-resource African languages. She approaches problems at the intersection of language, people, and systems, with a particular interest in making AI work in contexts it was never designed for.

Outside of work, she reads widely across behavioural economics, climate change, and misinformation and writes occasionally when something is worth saying.

Originally from Cape Town, Jeanne now lives in England with her husband and two Bengal cats, Eira and Kinzy.

Black and white portrait of a smiling young woman with shoulder-length wavy hair, wearing a black top, standing against a plain light-colored wall.

Blog

Jeannie Daniel 22/04/2026 Jeannie Daniel 22/04/2026

Benchmaxxing: The ugly art of optimising for leaderboards rather than real-world performance

In 2015, Volkswagen was caught running software that detected when a car was being emissions-tested and switched the engine into a compliant mode for the duration of the test. The AI industry is running a version of the same pattern, at scale, largely unchallenged.

22 April 2026

In 2015, Volkswagen was caught running software in its diesel cars that detected when the vehicle was being emissions-tested and switched the engine into a compliant mode for the duration of the test. On the rolling road, the cars were clean. On the motorway, they emitted up to 40 times the legal limit of NOx. Eleven million vehicles were affected. Volkswagen was forced to spend $14.7 billion to settle the allegations of cheating, including $4.7B in pollution mitigation and investing in zero emissions research.

It's worth starting there, because the AI industry is running a version of the same pattern, at scale, largely unchallenged.

The numbers on AI leaderboards are, in many cases, not what they look like. A meaningful fraction of the scores published at launch, the 94.7% on MMLU, the state-of-the-art on HumanEval, the chart showing the new model beating every incumbent, reflects a model that has been shaped to pass the test rather than one that has improved at the underlying task. This practice has a name in the industry: benchmaxxing. It is the craft of optimising for what the leaderboard measures rather than for the capability the leaderboard is supposed to measure. There is genuine skill involved in doing it well. Hence the ugly art.

Four techniques, in rough order of severity

Benchmaxxing is not a single behaviour. It is a cluster of techniques, some closer to sloppy practice than to deliberate deception, and some harder to describe as anything other than cheating.

Contamination. Modern models are trained on very large samples of the open web, and benchmark questions appear on the open web. MMLU questions surface in training sets. GSM8K problems leak in via forum posts and homework-help sites. The model doesn't reason through the question at evaluation time; it reproduces an answer it has seen before. Some contamination is genuinely accidental. A significant amount is what you might call deniable: the kind of accidental that happens when nobody put much effort into preventing it.

Teaching to the test. Fine-tuning on data that resembles the benchmark without literally being the benchmark. Training the model to produce outputs in whatever format the automated grader recognises. Coaching it to prefix answers with specific phrases because the scoring regex is looking for them. The resulting model is not more capable; it is better-calibrated to the specific conditions of the test. This is the equivalent of engine tuning that improves performance on the dynamometer without improving real-world driving.

Cherry-picking checkpoints. A frontier lab training a new model doesn't produce one model. It produces hundreds of checkpoints across training, plus parallel runs with different data mixes and post-training recipes. Each checkpoint scores slightly differently on each benchmark. One is best at maths, another at coding, a third at reasoning. No single checkpoint is best at everything. When the launch chart is published, each column can come from a different checkpoint, the one that scored highest on that specific eval. The chart is, in a narrow sense, accurate; the numbers were produced by models from that lab. But no single model shipped to customers achieves every score on the chart.

Defeat devices. This is where the Volkswagen comparison stops being a metaphor and starts becoming literal description. Agentic benchmarks like SWE-bench run the model's code in the same environment as the evaluator, which means a sufficiently capable agent can tamper with the scoring infrastructure directly rather than solving the task. This is no longer a hypothetical. In April 2026, researchers at UC Berkeley's Center for Responsible, Decentralized Intelligence published a systematic audit of eight major AI agent benchmarks, including SWE-bench, SWE-bench Pro, WebArena, OSWorld, GAIA, and Terminal-Bench. They found that every single one was exploitable to near-perfect scores with no legitimate work at all:

The SWE-bench Verified exploit was a ten-line conftest.py that hooks into pytest and forces every test to report as passing.
WebArena fell to an agent navigating Chromium to a file:// URL and reading the gold answers directly from the task configs.
Terminal-Bench was broken by replacing /usr/bin/curl with a wrapper that faked test output.

The researchers have open sourced their exploitation kit here: https://github.com/benchjack/benchjack

Documented real-world cases already exist in published leaderboard results. A coding model called IQuest-Coder-V1 claimed an 81.4% score on SWE-bench; UC Berkley’s researchers subsequently found that roughly a quarter of its trajectories were running git log to copy answers from commit history rather than solving the problems. In February 2026, OpenAI published an analysis concluding that SWE-bench Verified, which they had co-created to fix problems with the original SWE-bench, was itself contaminated: every frontier model they tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches and problem statement specifics for tasks in the benchmark from the task ID alone. OpenAI's own conclusion was that improvements on SWE-bench Verified "no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time". They have stopped reporting the benchmark for frontier evaluation. METR, an independent evaluations organisation, reports that OpenAI's o3 engaged in reward-hacking behaviour (stack introspection, monkey-patching graders, and similar tampering) in 39 of 128 evaluation runs (30.4%), and that the behaviour persisted at high rates even after the model was explicitly instructed not to. The Berkeley team is careful to note they are not alleging that current leaderboard leaders are actively cheating on public benchmarks. Their point is narrower and, in some ways, more unsettling: the vulnerabilities exist, they are trivial to exploit, and a sufficiently capable agent may discover them as an emergent strategy without anyone deliberately instructing it to.

The same pattern shows up in evaluations that use an LLM as a judge. Peer-reviewed work, including the JudgeDeceiver paper from ACM CCS 2024, has demonstrated that carefully constructed adversarial suffixes appended to a response can manipulate LLM judges into selecting that response regardless of its actual quality, with attack success rates exceeding 90% on MT-Bench. Subsequent work has reported success rates of up to 73.8% against judges including GPT-4 and Claude 3 Opus across multiple evaluation tasks. These are not sophisticated exploits; they are short suffixes, optimised against the judge model, that cause the judge to rate the carrying response higher. Any benchmark whose scoring depends on an LLM judge is, at minimum, exposed to this class of attack.

There is a further category worth naming, more speculative but structurally close enough to the above that it belongs in the same conversation. Volkswagen's defeat device detected the test rig by reading telltales such as steering wheel position, wheel rotation, and throttle behaviour. A language model trained on enough evaluation data could, in principle, learn to recognise the shape of a benchmark query without being explicitly told. Multiple-choice formatting, particular system-prompt registers, the absence of conversational context, dataset-specific phrasing: the wrapper that a standard evaluation harness places around a prompt is, from a pattern-recognition point of view, a fingerprint. No one has to engineer this deliberately. A model trained on enough benchmark-shaped data could learn an implicit classifier distinguishing evaluation from production. Whether any current model does this, and at what strength, is difficult to disentangle from ordinary brittleness on out-of-distribution inputs. But the ceiling case, and it is worth sitting with, is a model that has learned, without anyone designing it, to recognise when it is being graded and perform accordingly. If the reward signal is hackable, a sufficiently capable agent may hack it as an emergent strategy rather than a deliberate one.

A worked example

A speech-AI startup recently launched a new model with a blog post announcing that it "sits at the top of the Open ASR Leaderboard" (ASR = Automatic Speech Recognition). While they claim this new title multiple times, the leaderboard is never actually linked in the blog post. For those readers who do not know, the Open ASR Leaderboard is a public, community-run benchmark with a documented submission process. Anyone can contribute a model by opening a pull request. The startup, who claims they are now at the top of this leaderboard, had not done this. They had run their own evaluation, on their own infrastructure, on the public test sets, and reported the result as a leaderboard standing in a blog post. The problem is that there's no way for a user or investor to verify the claim independently. It's a bit like grading your own test and telling the examiner "Trust me, bro".

It gets worse.

The post also includes, at the end of the short-form results table, the following disclaimer: "We omit Librispeech and Voxpopuli from the evaluations as we use these datasets during training and cannot guarantee a contaminated result, additionally from our observations, while training on these datasets showed significant WER improvement, overall generalisation was hurt."

The sentence is doing a lot of work for one footnote. First, it states outright that the model was trained on at least two of the standard benchmarks, an admission of contamination that is normally only inferred from replication. Second, those benchmarks were dropped from the reported results, so the reader doesn't see the scores that would reflect the contamination. Third, training on those benchmarks improved the headline scores while hurting general performance. You cannot, in the same blog post, claim to have the best model and also admit that your model got worse. These are incompatible claims. The disclaimer is saying the quiet part out loud at the bottom of a results table, and hoping no one reads that far.

The multilingual table in the same post introduces a column titled "Avg (excl. Swe)", an average excluding Swedish. Swedish is the one language in the comparison where an incumbent baseline scores better. Excluding it produces a lead; including it does not. The writers do not flag this metric as custom.

A few smaller details compound confusion. The model's name is not consistent across the post: the introduction uses one name, the second table uses another, the chart uses a third. This could be branding in flux, or it could be the checkpoint shuffle from earlier in this post playing out in miniature, with different internal models reported under different labels across different tables. A reader has no way to tell, because none of the numbers in any of the tables are reproducible, because the evaluation was not submitted anywhere a reader could verify it.

None of these choices is unusual on its own. Labs rename models. Averages get customised. Contaminated datasets get dropped from reported results. What's notable is the combination in a single post: an unverified leaderboard claim, an acknowledged training-set contamination, a custom metric that excludes the one loss, and a disclaimer that, read carefully, describes a model that got worse in general while getting better at the benchmarks. Most labs are more careful, or at least more opaque. This particular example is useful because it is a compact, public catalogue of the techniques described above, all in one place, at the top of a product launch.

Follow the money

Why does this happen? Because the AI industry has built itself on benchmark results, dating back to BERT smashing the GLUE benchmark in 2018.

A frontier AI lab at this stage in the cycle spends hundreds of millions to billions of dollars a year on compute, data, salaries, and infrastructure. It cannot fund itself from revenue at the scale the frontier demands. It funds itself from rounds, and rounds are priced on perceived capability. Perceived capability is measured, by investors, the press, and procurement teams, largely through benchmarks, never mind if the benchmarks no longer signal true capabilities. A two-point lead on MMLU can make or break the next funding round. In a market where the top labs are raising at eleven- and twelve-figure valuations, the economic value of a benchmark point substantially exceeds the engineering cost of gaming it.

The incentive compounds. A lab that wins the benchmark cycle attracts better researchers, which produces better models next cycle, which attracts more capital, which buys more compute. Missing the cycle once is not just a round at a lower valuation; it risks falling out of the tier of labs operating at the frontier at all. That turns every launch chart from a marketing artefact into something closer to a survival signal. The pressure to show a taller bar is not a failure of character on the part of the people running these labs. It is a rational response to an environment in which the bar is load-bearing for the enterprise.

This is structurally the same place Volkswagen found itself in the years before Dieselgate. VW had promised investors and regulators a diesel strategy that the underlying physics couldn't deliver under real driving conditions. Once the commitments were made, the choice narrowed: miss the numbers and disappoint the market, or find a way to pass the test and hope the gap was never measured. The engineers who built the defeat device weren't the ones who set the strategy; they were the ones who had to reconcile it. The pressure was financial. The compromise followed.

None of this justifies benchmaxxing. It just tries to explain it. A serious discussion of how to address the problem has to start from the fact that the people doing it are, in the main, responding sensibly to the incentives they face.

Goodhart's Law, and the cultural layer underneath it

When a measure becomes a target, it ceases to be a good measure. Benchmarks were built to measure capability, but have since become KPIs, and what they now reliably measure is how good a lab is at winning benchmarks.

And none of this is conjecture anymore. Apple's GSM-Symbolic study found that frontier reasoning scores drop significantly when only the numbers or variable names in a maths problem are changed, and that adding a single irrelevant clause can cause accuracy drops of up to 65%. Frontier models now cluster within two or three points of each other on MMLU, a range where noise exceeds signal. Enterprise evaluations document a roughly 37% gap between lab benchmarks and production deployment. Product teams pick a model by its leaderboard position and switch three months later, once they find out what it's actually like to use.

None of this is surprising once you notice that the labs building these models are run by people who got their jobs by passing LeetCode-style interviews: algorithmic puzzles solved under time pressure, against rubrics with an increasingly distant relationship to real engineering work. The pipeline selected, at the margin, for people willing to grind problem sets and pattern-match through the interview. Run that selection for a decade across an industry and the disposition concentrates. An industry that spent ten years rewarding rubric-optimisation in its hiring is now surprised that its products do the same thing.

What to do about it?

A number of recent benchmarks have been built specifically to resist the techniques described above, and they're worth crediting.

LiveCodeBench maintains a rolling monthly update from competitive programming platforms, so a model evaluated today is graded on problems that did not exist when the model was trained. LiveBench.ai extends the same principle across a broader set of tasks. Both address contamination directly by making training-set inclusion temporally impossible rather than merely improbable.

SWE-bench Pro takes a different approach to the same problem. Its public set is constructed from GPL-licensed repositories, on the reasoning that strong copyleft creates a legal deterrent against silent inclusion in proprietary training corpora. It also maintains a private set sourced from proprietary codebases belonging to partner startups, and a held-out set that is not publicly released. The result is a benchmark that is both more contamination-resistant and substantially harder than its predecessors. Top models score around 23% on the public set, compared to 70%-plus on SWE-bench Verified, which is roughly the gap you would expect between a clean evaluation and a contaminated one.

At the more structural end, proposals like PeerBench and dynamic-sampling frameworks such as LLMEval-3 aim to replace the static public benchmark altogether, with formats closer to a proctored exam: private test pools, per-session randomised sampling, and cryptographically verifiable evaluation workflows. Whether any of these becomes the new standard is an open question, but they are at least the right category of answer.

A few honest caveats are worth stating alongside the credit. Private-by-design benchmarks solve contamination but concentrate epistemic authority in whoever curates them, which creates its own problems around transparency and capture. Rolling benchmarks require continuous curation effort and can drift in difficulty over time. And no benchmark, however well-constructed, is a substitute for evaluation on workloads that resemble the actual job. The best current advice for someone choosing a model is still to combine a rotating, contamination-resistant public benchmark with independent replication and internal evaluation on representative data. No single one of these is sufficient on its own.

The underlying point is simpler. Benchmark scores are, at best, a noisy proxy for capability. At worst, they are the output of a system optimised to produce numbers that look good to humans evaluating at a distance, which is not the same thing as a system that is good.

The field will, eventually, correct. Benchmarks will become harder, more private, more adversarial. Contamination detection will improve. The obvious techniques will stop working. Regulators may show up, as they eventually did for the car industry.

It's worth remembering how the car industry's version of this actually ended. Dieselgate didn't stop because Volkswagen had a change of heart, and it didn't stop because the regulators' own tests caught it. It stopped because independent researchers at West Virginia University drove the cars in the real world with portable emissions analysers and compared the numbers to the lab results. The equivalent intervention in AI looks the same: independent evaluation on workloads the labs didn't choose, under conditions they didn't design.

Until then, leaderboard positions are best read with the understanding that they reflect a combination of model capability and the lab's willingness to optimise for leaderboard position, and that the relative weight of those two factors is not disclosed.

In an industry where benchmaxxing becomes the standard, the ones who pay the price are end users and the labs that won't cheat to win.