Nobody is Doing AI Benchmarking Right — LessWrong

By Chapin Lenthall-Cleary and Cole Gaboriault

As LLMs and other forms of AI have become more capable, interest has steadily grown in determining how “smart” they really are. Discussion tends to circle, often obliquely, around the following cluster of questions: are the models as smart as people? Which people? How smart are those people anyway? What do we even mean by “smart”?

These questions suggest a straightforward approach. Obviously, the quality of being smart, or “intelligence,” can be possessed in different amounts by different people and different models. We want to determine the intelligences of models and people and compare them to each other; that is, we want a reliable test of intelligence that we can administer to both models and people – an intelligence benchmark. Even without settling on a definition for intelligence, it’s clear that the best strategy for designing such a benchmark is to start by developing it on people, because we have a more robust intuition for people’s intelligence and have more existing research to build upon. Any test that accurately, directly, completely, and exclusively measures intelligence in people will generalize immediately to models (assuming it can be administered through an appropriate modality, such as text); if a model and a person that we intuitively believe have the same intelligence receive different scores (or vice versa), then by definition either our intuition is wrong, or the test is not actually measuring intelligence accurately, directly, completely, or exclusively – though it may have appeared to be – and needs to be improved.

The first major lesson to take from this is that extensive data on different people’s performance on an intelligence benchmark is foundational to its usefulness; such data is our main tool to ensure that it actually measures intelligence, and our only tool to calibrate the interpretation of scores. This is true of nearly all benchmarks: the gold standard would be a full, population-representative distribution of performance and correlations of performance with other relevant variables (IQ, age, years of experience, or level of education could all be useful and interesting depending on the benchmark). Sometimes a full distribution will not wind up being that interesting, especially for difficult or knowledge-based tasks where the majority of the population will cluster around the performance floor; in these cases, and in cases where it is impractical to obtain a full distribution, having more limited distributional information (including how and from which groups people in the distribution were selected) is still useful and important: distributions among math postdocs and math undergrads for a PhD-level math benchmark, for instance. If data is very limited, even giving the performance of a few people whose place in the distribution can be estimated from other factors (for instance, a competitive coder and a smart person who’s taken a single coding class for a coding benchmark) and a best estimate at a few points in the distribution is useful, though of course heavily reliant upon the trustworthiness of the estimator.

Unfortunately, almost all benchmarks offer only a single score as the “human performance threshold,” or, even worse, none at all. It’s worth emphasizing how egregious this is: a single “human performance” value paints a very, very rough picture. Which human’s performance? Without further information, that could be a wide range. Given the common methodologies for producing these values, it’s probably (hopefully) somewhere between an average person and a moderately smart person. But even if the single value is a perfect estimator of the average person’s (or coder’s, mathematician’s, etc.) score, crucial information is missing. A model scores 20% better than the average person; does that put it on par with the 60th percentile or above geniuses? Moreover, since these values are usually reported as “averages,” the groups responsible for the benchmarks must already have data on at least the distribution of performance in a small sample of people that they computed the average over. Where is that data? A minority of benchmarks at least report values for both “human performance” and “expert performance,” which is helpful in giving some sense of the scaling, but it would still be better to report all their data, and even better to measure and report larger, less biased distributions.

Reporting little or no information on people’s performance allows for all sorts of sleights of hand and misunderstandings around models’ capabilities, both by benchmarkers and in public discourse. Benchmark creators who want to make their benchmarks look more difficult to saturate (and therefore impressive) than they actually are can use above-average people to calculate their “human performance” thresholds (or do even shadier things, as discussed below in the case of ARC-AGI). In public discourse, someone can easily cite a model’s unimpressive-sounding absolute score on a benchmark when it actually outperformed most or all people, or an impressive-sounding one when it underperformed, and there isn’t easy data to confront us with the important questions of how many of us the models can outperform and what their scores actually mean.

Relatedly, we believe the default of reporting a single “human performance” value arose, among other reasons, because of the language and mental frame surrounding strong definitions of AGI, where an AGI is considered to be an agent that can “do anything humans can,” meaning “match or exceed the performance of the best humans at any task.” (To us, this sounds like weak ASI, but it is nevertheless commonly used as a definition of AGI.) The natural threshold against which to measure models under this definition is indeed the performance of the best person at a given task, but since that threshold is conceived of as “doing anything humans can,” the performance of the best person just becomes “human performance.” Even those who don’t subscribe to such definitions have fallen into the linguistic trap of talking about “human performance” as a well-defined single value and treating it as such when creating benchmarks. But it isn’t: for a given task, a person’s performance can vary wildly depending upon his or her intelligence, knowledge, experience, and many other factors.

This returns us to the question of intelligence. There’s no definition of intelligence that’s universally accepted, but we believe most people agree that reasoning and novel (non-arbitrary) problem-solving ability are at least major components of what they would call intelligence. Perhaps more importantly, these faculties are helpful for most tasks and necessary for many, especially those that have the potential to make models genuinely transformative – or dangerous. Indeed, there’s a strong case that they are the most important faculties in this respect. We propose that a reasoning and problem-solving benchmark is a good approximation of a benchmark of the components of intelligence most people agree on, and also an independently crucial tool for understanding a model’s general abilities. Hereafter, we will refer to such a benchmark as an “intelligence” benchmark for brevity, though we are not claiming to have identified a complete definition of intelligence.

The obvious candidate for such a benchmark is IQ. Unfortunately, IQ tests have issues that make them dubious for people and farcical for LLMs. The largest of these issues is that they load very heavily on knowledge, which models are superhuman at and most people agree is not part of intelligence anyway, and processing speed, which has no clear meaning for models (see the discussion below on time horizons) but at which, as typically run, models are effectively superhuman. These peculiar loadings stem from a deeper problem with the theoretical foundation of IQ – in short, the positive correlation between all cognitive abilities is taken to show the existence of a single explanatory factor (read: first principal component) underlying them, called “g,” and tasks are selected for inclusion on an IQ test to maximize its overall correlation with g. As discussed above, if a test has problems when used on people, we should not expect it to generalize to models; and indeed, attempts to administer IQ tests to models have yielded comically inflated results that fly in the face of all intuition, such as a score of 136 for o3 on the Mensa Norway IQ test (and allegedly higher scores on other tests). Likewise, an absurd score for LLMs means that a test is less than perfect for people. This is not to say that IQ is useless, or entirely fails to measure reasoning, problem-solving, or even “intelligence” for many definitions; it’s arguably at least mediocre at doing so for people. But the noise created by its issues is massively magnified for models, and especially for comparisons between models and people.

As the example of IQ suggests, it’s surprisingly easy to make a mediocre intelligence benchmark (as easy as PCA, or even just finding some questions that seem like they should be decent at measuring intelligence and getting mildly lucky), but shockingly difficult to make a good one. Our best attempt to do so is Starburst, a game where the objective is to determine the laws of physics in a fictional universe from celestial observations. Basically no math or physics background is required (or, seemingly, even helpful). Starburst was originally intended as an intelligence test for people; when early testing showed promising results for it and tests like it, we began using it to benchmark models, and found that the models’ performance relative to people and each other matched our intuitions, and that models’ performances were clustered in roughly the expected range with a clean progression as they advanced. Unfortunately, for practical reasons including Starburst taking people roughly 5-50 hours (please fund us), we have very little data, though we’re working to get more data on Starburst and some shortened variants. A partial Starburst leaderboard can be found at tinyurl.com/starburstleaderboard.

At this point, some of you are probably wondering about ARC-AGI. It’s a well-known and very well-funded benchmark that tries to measure something like reasoning and problem-solving (as they frame it, the ability to learn new skills). The issue is that it sucks. ARC-AGI consists of visual puzzles where the test-taker is presented with several grids of colored cells grouped into input-output pairs; the test-taker has to determine the rule by which the output grids are derived from the input grids and apply that rule to a new input grid. As a reasoning test, like IQ, it’s mediocre but not terrible. It heavily loads on human-like perception (artificially deflating models’ scores relative to people’s). In some cases, it loads upon knowledge of conventions (for example, legends on the periphery of the input meant to be read left-to-right as in public eval v2 #2 (3e6067c3) and other tasks). Some of the answers are ambiguous or arbitrary. Even ignoring the above issues, it’s debatable to what extent it assesses true problem-solving ability versus whether one has similar intuitions to the creators. Whatever reasoning it does assess is limited to shallow spatial reasoning. It has a limited range of discernment. (Cole, my co-author who emphatically did not saturate Starburst, scored perfectly on the first 13 ARC-AGI-2 public eval tasks using their recommended pass@2, and got 12 of them correct on the first try.) Even still, because many cognitive faculties are correlated in people, and because it’s easy to make a mediocre intelligence test, ARC-AGI seems to clear that mediocre barrier.

After o3-preview saturated ARC-AGI, the team behind it released ARC-AGI-2. Aside from creating tasks that are overall more difficult, a fact they like to gloss over, their strategy was to find the kinds of tasks that models perform especially badly on relative to people, and create ones similar to those. While this should increase loading on any facets of intelligence that models still lack, it should also (probably more strongly) increase loading on human-like biases and perceptual style. This seems to have happened: anecdotally, ARC-AGI-2 scores correlate worse both with models’ Starburst performance and our intuitions about their intelligence than ARC-AGI-1 scores do.

Though the ARC-AGI tasks are mediocre, their published information on people’s performance is downright deceptive. Their core claim is that ARC-AGI tasks are “easy for humans, but hard for AI”. They cite a panel of people having perfect performance on tasks of which, at time of writing, the best general model solves 9% and the best narrow model solves 15%. This isn’t, mind you, asking a panel of people to agree upon a solution. When they report that a panel scores 100% on ARC-AGI-2, they actually mean that “every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts”. The actual average performance of the hundreds of people on the panel was 66% of tasks “attempted” (60% is also cited on their website, seemingly describing the same number). “Attempted”, here, means “any task view lasting longer than 5 seconds”. Given that participants were given a fixed time to solve tasks, a monetary reward for correct solutions, and no penalty for wrong solutions, they were strongly incentivized to skip questions that looked difficult, meaning that average performance was actually 66% for easy-looking questions, and an unknown amount worse for other questions. A (likely biased) sample of people were tested in an environment quite different from the models in a way that artificially inflated their scores. Given all of this, the only truthful answer to fair average performance of people on ARC-AGI-2 is somewhere between 0 and 66%.

Though it’s not intended to measure reasoning or problem-solving, there is one more existing benchmark related to other components of what people call intelligence that is fairly good (compared to most other benchmarks) and worth discussing: METR’s time horizon benchmark. The idea stems from the basic observation that models tend to be capable of tasks that people can do very quickly, while they struggle with tasks that take people a long time. METR tests models on a variety of tasks (mostly coding), then reports the length of the longest tasks that a model can complete with 50% and 80% reliability.

Unfortunately, this benchmark too suffers from not reporting any distribution of people’s performance. It might seem like this would be impossible; after all, people can complete all tasks in the benchmark with high reliability, so isn’t their time horizon effectively infinite? Technically yes, but comparing a model to a person with unlimited time isn’t reasonable: models all have limited inference-time compute, and even without those limits, they will eventually run up against their context length or, more often, a shorter effective length limit we have been calling “attention length” (since it seems to be a limit on how much input a model can “pay attention” to before losing track of details).

A fairer and more useful comparison is between a model and a person with a time limit. Ideally, we would test people repeatedly on the same tasks with different time limits to see the full relationship between task, time limit, and completion reliability for each person; however, since it is impractical and often impossible to test a person more than once on the same task, it is a reasonable compromise to just give people the task and see how long it takes them to complete it, assuming that their reliability is close to 100% at and above that time limit, and would drop significantly for time limits any shorter. METR does this, but only reports a single completion time for each task. Which person’s completion time?

Given the state of these attempts, we believe that developing a good intelligence (i.e. reasoning and problem solving) benchmark with robust data on both models’ and people’s performance is currently the most important unsolved problem in the field of benchmarking – indeed, one of the most important unsolved problems in psychology – and the key to an invaluable tool for assessing and mitigating AI risk.

And to reiterate, the importance of robust data on people’s performance applies to all benchmarks, not just those related to intelligence. All data collected on a benchmark must be reported. If at all possible, that should be a full distribution of performance, along with information about how and from which population people were sampled. If that is impractical, at a minimum, benchmarkers must:

report performance for at least two people or groups of people who fall at different points in the overall performance distribution;
provide good-faith estimates of or information about where they fall in the distribution, possibly including performance on other relevant tasks or tests; and
make every effort to evaluate people and models under analogous conditions conducive to direct comparison of performance, and report these methods transparently.

A benchmark whose creators are aware of and still fail to meet these requirements should not be taken seriously. Withholding data isn’t generally a sign that the data actually corroborates the stated conclusion and is usually an indication of incompetence or deliberate deception.

We intend to continue our work on Starburst, other intelligence benchmarks, and our hobby intelligence and LLM research more broadly. If you are interested in taking Starburst, have any critiques of our post, or want to contact or collaborate with us for any other reason, please don’t hesitate to reach out at chapinalc@gmail.com.

Leave a Comment Cancel Reply