Epoch AI offers a necessary pivot toward scalable, AI-driven evaluation as traditional benchmarks rapidly hit their ceiling. It is a pragmatic admission that we must now use the technology itself to measure a progress that has outpaced human-centric testing.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Are AI benchmarks doomed?Added:
If you look at all of the AI benchmarks, it seems like most of them are saturating really, really quickly. A saturated benchmark is not a problem.
I'd almost say we're living through a golden age of benchmarking. Models were not that capable and there was only so much for benchmarks to say. Now models are much more capable, but this just means there's much more for benchmarks to potentially tell us. Oh, well, this benchmark has all these flaws. Uh, was it really worth all this effort? Well, think about how unhappy you'd be if you had nothing at all. That does seem like we'd be in a much worse position. Where do we go next with benchmarks? What exactly do benchmarks look like in the future? I think there are some benchmarks that might I mean let's loosely survive the singularity. So AI benchmarks seem to have a really big problem right now. If you look at all of the AI benchmarks, it seems like most of them are saturating really really quickly. And by really quickly, I mean like months for most of them. If they're really really good, maybe they'll last for like a year or two. But then for the most part, it seems like it's very hard to build a benchmark that can last quite a long period of time. So there's a looming question that revolves around all of this. Are AI benchmarks doomed?
So today I am joined by two colleagues at Epoch AI. To help answer this question, would you like to introduce yourselves?
>> Hi, I'm Greg Bham. I run our benchmarking team at Epoch.
>> I'm Tomvsky and I'm a senior research engineer on the benchmarking team and I work on developing new benchmarks. And I am Anson Ho. I'm a researcher at EVO. To start off, I'd like to get a nice little vibe check of where you guys stand on whether AI benchmarks are doomed. So, what do you guys think?
>> Yeah. So, I think benchmarks will continue to be important as long as people want to like have some kind of qualitative description of like what an AI system can do or want to quickly compare when a new model comes out like which one is better. And so it seems like we're sort of stuck with benchmarks uh regardless of the many flaws that they might have just because there is uh this obvious demand for information this gap that they fill. We might be in a situation where benchmarks are less useful uh than they used to be like they explain less of the of all that we might want to know about an II systems performance but you know they still add additional information and so people are going to continue to release new benchmarks and look at benchmark results.
>> I'm a bit more of an optimist in this perspective. I' I'd almost say we're living through a golden age of benchmarking where it used to be that I think models were not that capable and there was only so much for benchmarks to say. Now there are now models are much more capable but this just means there's much more for benchmarks to potentially tell us. So maybe as Tom was saying the percentage of things questions you might want benchmarks to answer that benchmarks actually answer might be shrinking but the amount of information we're gleaning from benchmarks I think in some sort of absolute terms is is growing and I think this is very exciting and I think benchmarks will survive and be important and even potentially central so long as there are things we are curious whether AI systems can do and at least you know that seems Like there's still plenty of questions about what AI systems can do. I think there are some benchmarks that might even survive I mean this loosely survive the singularity. One thing I'd want to understand better is like why some people are so much more pessimistic. I imagine some researchers who are in AI safety, they would probably say, well, if you look at uh benchmarks like Frontier Math, the researchers put a lot of efforts into trying to make these benchmarks at least last for quite a bit of time.
>> And it seems like maybe within like one or two years, which is already relatively good for some of these benchmarks, they're getting to the point of saturation, and now like we're having to spend millions of dollars to build these benchmarks. Can we really keep doing this? It's like if it costs millions of dollars and the gains are maybe like not that high, maybe it's just hard. I'm curious what you guys think about that.
>> Yeah, I mean I think what you said about the gains not being high, like that's really the key. Yes, I agree that as the tasks that AI can do get more and more impressive, creating benchmarks for those tasks becomes more and more costly. And so then it just depends on if the like is the benefit side high enough like and I sort of suspect that this will be the case because while as AI gets more powerful it's just more important to know what it can and can't do or which AI systems are better than others just in the same way that everything is increasing like AI company's compute spend similarly the cost that benchmark developers spend on developing new benchmarks is also increasing a lot. Yeah, I think this is sort of fine as long as people, you know, like care enough about the answers we get from these benchmarks. Yeah, maybe caricatururing slightly your pessimist here, but I feel like it can sometimes come from like, oh, well, this benchmark has all these flaws. Uh, was it really worth all this effort? Well, you know, think about how unhappy you'd be if you had nothing at all. Literally all benchmarks were saturated. That does seem like we'd be in a much worse position. And like if we were in that world, the premium on being, you know, like the one team in the world that has a non-saturated benchmark would be huge.
So, so I do think that like basically costs and benefit might keep pace with one another. Yeah, I think it's not crazy to measure the benchmarking budget as a percent of revenue of AI companies.
I also just wouldn't underestimate human cleverness. Like I I do think benchmarking used to be kind of super easy, too easy to make a benchmark that was at zero and now we just have the but like now you have to be more clever to find a benchmark that is unsaturated and sometimes you'll be wrong about what is or or is not saturated and I think but that's like a fine tradeoff like we should be you know generally happy to have opportunities to exercise our cleverness and try and I think there's some historical examples I I think part of where this pessimism might be coming from is we have just seen this big ability spike, the sort of qualitative ability spike with coding agents starting to just work. And this means that some tasks that we had put in benchmarks thinking they were hard are doable now. And I would just point out this has happened at least twice before I I think roughly one where call it around GPT4 they just could do all these easier question answering or language manipulation tasks and so some benchmarks were saturated and people did have to be clever to come up with harder benchmarks fine and then there was also reasoning models that came out and suddenly some math benchmarks were saturated I think if we feel a little shell shocked that like right now that's understandable but probably like I think if you just look around at the world there's plenty of things systems can't do and if you have to spend some more money on them fine like that that is the case you can have benchmarks that survive these paradigms. I think GPQA is this really good example of this was made at the end of 2023 before reasoning models in their current form were even on the on the horizon and I would argue was only really saturated in winter of you know two years later 2025 and I think that's impressive like reasoning models definitely did better on it like there's a big spike around 01 but it's not it like wasn't totally saturated it was a high effort benchmark though you had to get these experts you had many experts reviewing for each sample each question and testing out each other's questions. So you could tell that the chemistry questions are really hard for the physicist. Anyway, it was just like it was more effort. Uh it was more expensive. People paid it. It was worth it. And while some benchmarks that were supposed to be hard like in terms of math were completely saturated uh when 01 came out like GPQ wasn't like this. I I think like so we'll have some wins and losses in this metric. The last thing I'll say is a saturated benchmark is not a problem. Like I think even having a benchmark that is saturated upon release 100% e because you started developing it four months ago and then AI progress happened to just like hit the nail right on the head. Like that's very useful to know because it you know dramatically reduces your uncertainty about what this qualitative feeling of this vibe uh of AI progress actually means in in terms of numbers even this is relevant. So while it's a little disappointing if your benchmark is saturated on release, I still think it can be quite valuable and maybe there's some lessons we can learn about how to try to build benchmarks like this. We maybe come back to that. But but I I I just feel like this this pessimism is like over updated. Yeah. I guess like one kind of counterargument that actually comes to mind is like oh but okay so cost is uh one thing that maybe we're willing to pay a lot more for because we like we at Ephog believe that it's very valuable to have these kinds of benchmarks. But then what about the time it takes? Like it costs in terms of time for trying to build these benchmarks. You know, I don't want to underestimate human cleverness, but I also don't want to underestimate AI cleverness. And the AIS are getting really smart. They're going to, you know, just crack all of these benchmarks so soon. Even if we spend like six months building this benchmark, you know, by the time we're done, it's like not going to be great. I mean, because it's going to be saturated.
>> I mean, I do think this is an argument for developing smaller like bite-sized benchmarks faster. in some ways put something out as a trial balloon that you think is towards the harder end of the benchmark you would build of of the of the distribution you'd want your benchmark to cover and see what happens as you keep filling out the benchmark and if that balloon gets popped then you say okay like I I you know need to work on a different project or whatever but again that served its purpose I do think there is some development some like lead time risk to any benchmark where the fundamental infrastructure will take you six months before you could even have a sample. I'm not so worried about that because I think any benchmark should kind of have a manual experiment. You have some software task you sort of want to make a benchmark out of. You just ask Claude code to do it and see how far it gets and and and you get some sense of that. I do think like benchmarks starting out with that is good and and something more like agile development of benchmarks would be like a good lesson to learn. But yeah, I think it's worth updating, but it just not updating all the way to, you know, benchmarks are impractical now because again, like to be grounded as long as there's a task that you today like might practically want an AI system to do and you put in like half a day's work eliciting it and it doesn't do it like okay, there's absolutely today a benchmark there.
Yeah, I like the um agile development points. I feel like that's something that you know maybe historically because benchmarks have come out of academia it's been very much like you know you don't share anything with the world you uh work for months until you have this super polished paper and then you release it maybe moving to something you know like a little more gradual a little more like you know open source software development where there's continually improvements being made maybe maybe that's promising two responses to your you know like counter time lead time uh objection So you know one is just we need to look at what's parallelizable and what's not in the benchmark development process. So if for parallelizable things you would hope you can just throw more resources at them and like you know make it faster that way and then there will be some nonp parallelizable portion for that part. I think if the worry is well we as humans are just too slow and like AI progresses is very fast. I think well AI systems are al are helping with everything including benchmark development.
So I think this is um this is something we see already in our own benchmark development work like for most technical work that I do like LLMs are a pretty essential tool and they speed me up a lot. So to make sure I'm understanding it's like the AIS are helping you build the benchmarks faster. And the other thing is like you know to what extent can we like break this down into multiple chunks where we can just like throw more resources at the problem. And I kind of want to dig into this second part a bit more because like you guys are the ones who are building the benchmarks on the ground. And I know Tom you recently been working on a benchmark and my understanding is meant to be like meters time horizons like task sets 2.0 or something. Uh yeah could you say more about that?
>> Yeah. So maybe I'll not answer it directly but take a step back first to say with this yeah question of how do we make non-saturated benchmarks one angle I really like on it and that I've liked for a while is okay are there tasks we can find like categories of tasks where you can just take the same setup and crank up the difficulty as much you know hopefully ideally like as much as you want to infinity but you know maybe that's also sufficient if you can just crank it up a lot. So yeah, I like this idea and I've been working on a benchmark that is sort of like my instantiation of this idea for the software engineering domain called mirror code. And it's called mirror code because the AI has to reimplement some existing program and mirror its functionality perfectly.
Yeah, maybe a little bit on the setup.
the uh these are all uh command line programs that have a command line interface. So you know but that can be all the way from you know like simple command line utilities like like durame or ls up to you know huge programs that just uh happen to have a command line interface um such as interpreters for programming languages type checkers etc. We give the AI system the documentation for the original program. We don't give it the source code and we give it access to a blackbox reference implementation.
So binary of the original program that it can send inputs to and view the outputs. If things are underspecified in the documentation or if it wants to see the exact output format or test new hypothesis, it can do that as much as it wants against this reference binary. The hope with this is that you can really scale that several orders of magnitude of size of the original program and uh you know hopefully also like amount of effort for the AI um or or or humans to complete the reimplementation task.
Programs that were that are really trivial and were like 10 or 100 lines in the original up to 10 million or tens of millions of lines of code like the kernel or like really complicated compiler chains. I think there's uh just a lot of room here for scaling up to the largest software projects ever in the history of software development >> and like how how far did you scale it in practice?
>> Yeah, so we're still sort of figuring out exactly sort of like what we'll release what we definitely have so far are a couple of programs that are in the you know roughly 100,000 lines of code without counting dependencies in the original implementation. An example of that is Pickle, which is uh this like new programming language came out in 2024 from Apple. In our experiments so far, the the best AI systems with something like hundreds of millions or a billion tokens over the course of the run are not yet able to complete these very hardest tasks, but they're able to do pretty reliably everything up to that level of difficulty. As of recording this podcast, I feel very uncertain about whether with more tokens they would be just able to do everything. I would say it's currently my best guess that yes, they would be able to do everything up to the like 100,000 line or so size. You know, with this benchmark, we did originally I did originally envision it as okay, this is going to be a really hard benchmark for AI systems and we created a lot of tasks in the early phase of the project that are now saturated. It certainly shows that even when you think you might be setting the bar high enough, accounting for how much progress AI will make, you might still be you might still be underestimating it for very precisely specified tasks. The AI really knows absolutely everything the program has to do. It has to like output exactly this string on this kind of input etc. AI systems can just keep going at it for like many many times the size of their context window with compaction. And because the task the task is sufficiently precisely specified, they sort of know where they're at in terms of their progress and they do even these very impressive ones that we would guess represent several weeks of human work.
This still has a bit of room to go like scaling to the biggest human software projects ever to sort of help us answer like yeah if we tell an AI system like very precisely what to do, can it do anything in software engineering? Make sure I'm conceptualizing this correctly.
It's kind of like this is supposed to be a bunch of like you're saying multiple week long tasks like hundred thousands of lines of code and these are things that we were thinking were going to be really really hard for the AIS but then it seems like before we've even released the benchmark AIS are already able to do a huge chunk of these as long as you're using what what was the token budget just to be clear these time estimates for how long it would take a human to do the task are guesses We don't have data on this.
The multiple weeks is sort of my personal guess. The hardest task uh that the that AI can definitely do in mirror code which uh is com implementing the common mark spec. So which is a formalization of markdown that tells you like exactly for any markdown how to convert to HTML the reference implementation for that. So it's about 16,000 lines of C. My personal guess, which you know is extremely speculative, but that this would take an experienced software engineer who's completely unassisted by AI multiple weeks to reimplement. I see.
But then it's still the case that like if you were to invest in building like the month-long versions of this or like maybe the year-long things which are like the ideal things to do in the future, do you think that there's still plenty of room to keep scaling this up?
Well, so I don't want to make strong predictions about whether AI will able to be able to do it or not, but I think like that's almost, you know, maybe it's like I'm a little bit dodging the main question you want to ask with this podcast, but that's sort of like not really my main like I think this is interesting just because okay, it lets us describe AI capabilities on precisely specified software tasks across these orders of magnitude of difficulty. Yeah, it's great to know like whether it can whether AI can do that or not.
>> Yeah, you sort of >> care a bit less about like oh like will it be saturated by a certain date and you know I agree it's relevant because like oh like people want to be able to keep tracking AI progress. I don't feel very confident about making predictions for that. Yeah, what I can say is that you know Nicholas Carlini stopped theum anthropic cmp compiler experiment based on you know my impression is pretty much his gut feeling of like ah like it's gotten up to here it seems to now be sort of stagnating to be introducing bugs when it tries to introduce optimizations and sort of decided to stop it there. I don't really know what his criteria were. Or maybe he just wanted to spend up to $20,000 of compute and didn't want to go further.
So it's like clearly the case that you know if Carlini had wanted to say like okay no no the task is to compile all these projects and have the resulting code be as efficient as as GCC. I sort of feel torn between two inclinations.
You know one is like I mean it just you know would seem so crazy. um for AI to be able to rebuild the largest software engineering projects ever from the ground up representing many years of work by hundreds of people. So like that still, you know, feels kind of intuitively shocking on some level, but also you know I obviously have updated on the results that like okay it can do these really impressive things on common mark in our experiments. It can like make substantial progress although not fully solve our hardest task within a billion tokens and it can do Carly compiler. So between these two two pulls I end up just being very uncertain.
>> Isn't this a win for benchmarking? Why is this is this or do you would your steelman pessimist claim that it's uh that this is a problem?
>> Sorry that what exactly is a problem?
the the state of the of Mirror Code upon release as AI systems having perhaps made more progress on it than we would have predicted when Tom began working on it call it however many months ago.
>> I would have thought naively that they would see this as evidence that you know it's actually just really hard to make these kinds of new tasks but then it depends on like how far we push things and like what the costs are and what the benefits are which is sort of like the thing we were saying is like the thing that matters. I'm kind of curious for like both of your takes in in the case of mirror code and in the case of frontier math open problems um that relates back to what Tom said earlier where you know you can whether it makes sense to build these benchmarks and whether or not you're going to have trouble continue to build these benchmarks are saturated depends on whether you know AI can speed up your the benchmark building process and also how much you can parallelize things. So on these two different dimensions like >> how much have you found AI to be helpful for like speeding things up when you're building benchmarks and also to what extent is it the kind of thing that where you can just like absorb more resources and it's very >> yeah so on absorbing more resources you know mirror code could have benefited from a lot more full-time software engineers on it. I was basically the main person with a lot of engineering experience on the project. Although I certainly had some help from collaborators and I definitely feel both in terms of adding target programs to the benchmark and also like set also setting up the infrastructure just having three engineers on it would have sped it up a lot. Obviously this is from a low base. If you have a 20 person team with an anthropic like can you still sort of scale that up to 50 or 100 people and and get similar speed ups?
feel more uncertain about that. And then there's like just adding more samples to a benchmark. One would hope that this is sort of like inherently pretty parallelizable.
>> I'm curious. Well, curious for the AI speed up one.
>> Yeah, AI speed up. I mean like this is we all know from meter's research that people seem to be pretty bad at estimating this and I feel I myself feel very uncertain but you know like dumb to my head if you really forced me to pick a number I would say 2x speed up. I suppose I'd give similar answers here for French math open problems. The like task the problem contribution is embarrassingly parallelizable limited only by the mathematicians. I shouldn't exactly say that we have a review like like I review all the problems and so that that's a limiting factor but the the for the most part and then each problem contributor develops their own verification program. So have more of a diversity of you know AI speed up. Some of them certainly used AI but anyway I I I believe that speed up is you know moderate there. The bottleneck is more in having the idea for the problem. And I don't think the AI systems are so good at finding problems that meet our admittedly somewhat unnatural constraints of being unsolved math problems of a certain degree of interestingness with solutions that happen to be verifiable. So, we've just covered a bunch of things about whether we think benchmarks are going to be doomed to be saturated as we try to build them out because AI progresses so fast. But there's another way in which benchmarks could be doomed or at least as I understand it, which is that no matter what, benchmarks are just not going to be able to capture the things that we care about no matter how much effort you put into trying to build them. So, like the kind of examples here would be like there's GPQA diamond.
People often say it's like PhD level science questions. If you can do GPQA diamond, then uh you're going to be able to do PhD level science. Somewhere along the line, the logic breaks down. You know, you can do GPQA diamond, but then maybe you can't do all of PhD level science. What is wrong with this particular line of argument? Is it wrong? Like, do we think that AI benchmarks are doomed in the sense of not being able to capture these real world impacts? I mean, I think the the argument might be a little overstated in the snapshot you gave already of I'm pretty sure that models that did well on GPQA Diamond indeed generalize to the task of answering questions qualitatively similar to those in GPQA Diamond. One lesson to learn from this is just make sure that when you say if an AI model can solve this benchmark then it can generally do tasks like the task in this benchmark that you you'll you'll never go wrong short of like abject cheating training on test you you won't go wrong by saying okay what this means is if I give it a self-contained grad level science problem even one that you need to be an expert as was verified for GPQA in the domain to solve then it'll solve that. You just leave the listener to their own devices to generalize. Okay, how much how much will that help someone working in science?
What sort of uplift will that give to a non-expert, you know, a biologist doing a chemistry problem outside their comfort zone, whatever, like that, you know, but the benchmark was never going to tell you that because that's not what the benchmark was about. I would say incidentally we seem to be in a period where you don't even get in that much trouble for generalizing a little you know maybe beyond the letter of the benchmark task by which I mean coding agents seem genuinely useful e even if many of the tasks we see are not obviously in distribution for benchmark task some of this is contingent like this is happening only because the AI companies are perhaps behind the scenes shoving a lot more tasks than we see into distribution into training but but still short again short of cheating you should expect benchmark generalization you know machine learning works like machine learning works to generalize into the training distribution like that's that's fine and so I think what this means is we should be very careful about extrapolating benchmarks but we should also be very thoughtful and put a lot of effort into trying to put the benchmark pin right in an important area an area that tells us something we actually care about in inherently And I think both of the benchmarks we've talked about earlier that epoch has been busy developing both mirror code and frontier math open problems uh meet the spec to to to like a clear degree. Like mirror code is just if I have a really clearly precisely specified test suite or at least you know spec then I can expect AI systems to develop software of that of that nature at least to a certain degree of complexity which MR code helps you understand and that's inherent like I don't think it's a stretch to say that's inherently of interest to someone who might be using the system for practical purposes deciding whether to fire all of their software engineers or or even doing research on you know soft software intelligence explosion like what sort of tasks go into AI research is this h how much of them are tasks like this and you know that that so so this adds clarity in very helpful practical ways so too with frontier math open problems even more so it's like these are problems that there's no generalization required at least for each individual problem it's something some mathematician would really care about them personally would care about seeing solved. If you've devised your benchmark well, you shouldn't care about generalizing too far beyond the benchmark because the benchmark itself is from a distribution you genuinely care about. One thing I'm a little unsure about though is like, okay, so this all sounds good. We can be pretty confident in the claim that if the AI does the benchmark test, then it's going to be able to do very similar task to that thing.
>> Yeah.
>> But then what counts exactly as something that's very similar to it? In practice, people often do want to try to generalize these things like further.
And like although we we say we should be careful about like generalizing further, it's like very hard to say exactly how much that is. Like one example here is GDP. I think in their paper they're explicitly saying or they're motivating it in like the first few paragraphs.
>> We wanted this to be something like a leading indicator for a lot of automation.
>> And I guess unfortunately it wasn't successful at that. probably they spent like millions of dollars in building this thing and doesn't like seem to fully reflect what we've been seeing in say productivity statistics and so on.
>> Well, they they I think fell prey to a pun in the name and it is catchy. GDP val. It's great. Evel pevel. I think you know you just have to look at the task and say maybe you have a like a Matt and Bailey but in in the good sense like you you you have a core goal and a stretch goal say where the core goal is for GDP val for example saturation of this benchmark should be evidence that AI systems can do self-contained tasks drawn from a wide range of digital work.
And I mean to emphasize self-contained quite a bit because these tasks are very self-contained. You have you do web search but apart from that you're given the documents you need and you're given your task and you output basically a document usually often just a text document text file that is your output.
And so it's it's like quite self-contained compared to the actual work environments that humans face. So you know the the core goal is just can it usefully offload tasks like this. It's extremely consistent with my experience that over the last year offloading tasks of complexity like you know less than a day's worth of work for me to put together a written report on some topic that you know requires expert like like they've gotten a lot better at that. Of course they have. Now for automation you know there's a like I think it would just have been foolish to expect that this would automate. Florian Brand who worked on the uh the same EVA like report on GDP had a great analogy. He said you know the self-contained nature of these tasks is somewhat analogous to the self-contained nature of bug fixing or small feature addition in software engineering. So just as AI systems currently have not automated software engineers as a whole profession but they have transformed the workflow. you you now delegate and manage much more of your time than you spend writing. So too, saturation on like Apex agents or GDP val or RLI would mean that you are that if you're a knowledge worker in these in these other domains, you too could see your daily workflow transformed. But like the benchmarks just aren't target GDP val anyway is just not targeting automation enough for you to expect to generalize there.
>> Yeah, I think that makes a lot of sense.
And one thing that I think is like maybe this suggests is that we need to be or there's a lot of value in digging into the details of what this benchmark actually tells us because it's very easy to be like oh GDP val and then GDP and then okay but then actually no we need to look into what exactly the tests are and like as sort of as you were saying is like the specific tests actually seem like they do generalize better if you look at what those tasks are rather than GDP or whatever.
>> Sure. I certainly agree with you that this sort of effect that Anon was describing doesn't mean that benchmarks are doomed. But I have a slightly different perspective in the sense that this like slogan of benchmark reality gap does resonate with me a bit more where you know if you told me in 2020 that AI would solve GPQA style questions so where there Google proof so you even with like arbitrary web web access you can't just find the solution written somewhere you have to not only combine a bunch of knowledge but also do a bit of reasoning about you know these pretty advanced science topics I would have predicted much much bigger effects of AI on the economy and society than we in fact saw when AIs were say at 50% on GBQ and I think this is the case for many people and yeah to some extent this is like okay like I should take the L I would like naive in how I was thinking about benchmarking and maybe some people were much wiser about it but it does kind of ring true to me that there seems to be the systematic way in which we try to design a benchmark that we hope will capture this broader thing and then we see AI do great at it but the yeah the real world usefulness or impact >> isn't quite there for myself I want to take into account the track record of um how I've been surprised by this the sense in which I feel like it goes beyond just like oh well you are wrong and naive about the benchmark at the start is you know maybe there's just something inherently very difficult about squeezing you know all of the complexity of like real world long horizon tasks into something benchmarkable and we're going to like keep systematically bumping against this even as we try to make benchmarks better and more realistic. So yeah, I do feel like there's something to be aware of here. But yeah, in terms of like does this do benchmarks? I mean, you know, no, because of it still seems like, well, even if we were wrong about um what GPQA meant, we can try to take the lessons from that and design the next eval better. Basically, even if we continue to be a bit wrong about this, like hopefully benchmarks are still useful. Two responses.
One is more leaning into like yes, I think people do expect more from benchmarks than they ever should have.
the one AI paper I wrote long before epoch was on like a critique of benchmarks at the time and people not investing in making sure benchmarks matched distributions that they wanted even a little. This was like 2019 and it's not I mean but you guys got to know the situation was much worse back then like it really wasn't clear that benchmarks correlated with anything and and uh so I think there's some zen of what you should expect from benchmarks and and yet I think they're better than they've ever been like there so so the lessons I think were learned over time of we've got to make this something that isn't meant to be like something random um that AI systems just can't do today.
But if they could do it, I'm not sure I'd like feel informed about anything other than this like random niche. I think benchmarks like a lot used to look like that and they sometimes they do still sort of look like that today when people find quirks whatever ours and strawberry or something you can make benchmarks out of that. But these are like more I think hobby efforts on the side and the big benchmarks people pay attention to have been centered over more meaningful distributions. And I think this does point to the sort of progress you're saying. And if you couple that with the perspective of like modesty in inferring from benchmark results what impacts you'd expect on the world, then you can be you can be very happy about benchmarking.
Join me in happiness. The invitation's open. It's great here. But the other thing I would say is have seen a lot of impact of AI on the world. Like we have this like massive marshalling of societal resources to make more systems.
Like the signals that people needed to see to choose to invest a lot of money including now very meaningfully growing revenue from just consumers not just investors uh like were strong enough that people did say like this is a big deal and acted like this is a big deal.
In some ways, the benchmark progress like did indicate real impact on the world. And the fact that we weren't necessarily exactly right about the shape or the immediiacy of what human level performance on GPQA diamond like was is like if you zoom out a little like we maybe we were right like like this is this is a big deal. or even going back further in benchmarking history, not that much further to like like Winterrad schemas, the ambiguous pronoun resolution tasks like this was included in some I forget where it was from, but some list of like AGI will be here when five things are true and one of them was like sufficient score on a complete, you know, more or less completely saturated when a grad schema test. And I think what I was trying to get at was look this is sort of a tricky task that requires world knowledge and and you know fluency and natural language and that's got to be a big deal if that happens. And it wasn't like immediately when you got that when you got like systems just blowing this benchmark out of the water that the world was transformed literally overnight. Like I think it was a big deal like it's a big deal that we have AI systems that can do well on language tasks and can very flexibly use human language and this this is like was one big blocker in AI being useful and that blocker is mostly gone.
>> But like if AI had stopped progressing at the level where it did really well on the window schema benchmark, I feel like we wouldn't have seen that much impact.
>> I I I'm not sure that's totally true.
There's a version of it. There's like maybe an arrow version that's true, but if you'll give me a little rope, like I think if AI progress had plateaued with like GPT4 levels, but not reasoning model levels, there was already I think a lot of economic transformation or whatever economic value like baked in that it was going to take a while to figure out how to, you know, use everywhere. linguistic flexibility even if you don't have super precise reasoning I think like you know is a technology on par with yeah is like a tech of the decade kind of thing like that's not bad and I think when schemos being saturated probably was a meaningful sign that you were there and and if you had plateaued you still would have had like yeah you would have had something you would be like wow like used to be I couldn't really talk to a computer and now I can kind of talk to a computer and that's meaningful and I think the benchmarks would have played their role role in like helping you like at least dismiss extremely reductive cases uh or you know of like yeah this doesn't no no we used to have no idea how to solve these puzzle these puzzles and it seems plausible you need language skills to do it and you can so now you know impact a hoy >> so one thing I wanted to kind of make sure I'm understanding correctly uh for both of you is um do you guys both think that AGI bench like kind of exists like you know the if you have this benchmark park and then you were to just like train on it and hill climb it. You did it. You saturate it. Now you got AGI for sure.
>> Yeah. So I'm actually I don't find the term AGI very useful to begin with because of this point that many many people have made. I'm obviously not inventing it that the capabilities of you know even before AI computers and like now AI systems are heterogeneous in terms of yeah how good they are at different things. And it seems like we could see you know huge impacts of AI on society and the economy before we have this you know generality where it can do you know all or almost all of the things that humans can do can do. I just don't think this AGI label is that useful and instead we should be saying like okay what are the capabilities that we think are especially relevant and important and let's try to build benchmarks for those.
>> I do think there's a spirit of your question which is fine. You could have a breath of benchmarks and concatenate them and say here's my mega benchmark.
And do I think that is possible to build? I think it'd be very expensive.
We're talking a lot of poss a lot of tasks. I think there's sort of this magic ingredient sitting behind these things which is something like generalization. Will we get a system where doing well on one task is strong evidence that it will be able to do well on another task where humans sort of have something like this quality. So I think this generalization question is very interesting benchmarks that could help you identify general reasoning.
There have been attempts at this like this is what ARC AGI is supposed to be all about. You know AGI arrives with ARC AGI6 or whatever. But like you know I I think like that's actually sort of a plausible view. They like clearly haven't pushed this to the human extremes. But there's other approaches you could take to try to measure this kind of out of distribution generalization in context learning kinds of things. Like I I what what one idea I've heard discussed, you get the the latest video game uh that's popular on Steam and you see if an AI system can play it well and like that gives you some sense of okay generalized. I guess you might worry that like even this concept of generalization is uh you know like actually once you look under the hood it's like this like super weird multi-dimensional thing and like we can't really conclude that much about you know from this like random new video game on Steam performance on that. Well, maybe it just doesn't tell us that much about if I bring in an AI as a new temp worker for this kind of like low level administrative task, how well will they do on that? I would still worry that you end up measuring as like well can it generalize like within the specific subdomain of course or at this type of task. I I I do think like I I guess what I'd say is I think there's room for somewhat cautious optimism here because we have in fact seen sparks of AGI like you know we we I do think that's a fair characterization that we have seen some degree of generalization unclear how much that was from shoving things into the training distribution like that's a big question but but you know you you could see that if it you know you could you can maybe hope to detect something like this like whatever we have a benchmark for boring temp work that we keep hidden and we have a benchmark for video games or whatever and we see if they like progress is made at the same time on both of them and if it was I would say we're seeing an interesting thing emerging um but but uh it's also of course hard to know whether okay that just happened because someone in the lab happened to you know buy someone someone the lab happened to buy an RL environment that looks a lot like one of your you know one of your hidden benchmarks like this ideas aren't it's hard to be that original. So, so I do think this is uh is something of a um of a question. But again, these like lists of things that will herald AGI, I don't think have been terribly off base. Like I actually think that like I think we've learned some lessons. What are things that have improp that have not heralded AGI? I think would include like chess, like deep blue beating Kasparov was not a a moment of general AGI. However, the techniques develop there like there's still like a little bit bit of like yeah know it was correlated with the same like thing that society was trying to do for for a while but fine call that a loss but I think a win is like yeah the the the the sort of broadish you know hodgepodge of tasks like shows some general capability and then yeah I don't know like maybe there's you know maybe this generalization is still something benchmark should be paying attention to over and above any particular task.
Okay, so given all of these things like it sounds like in terms of saturation we don't think uh that or you guys don't think that the benchmarks are necessarily doomed uh in the case of like how much they can generalize a lot of interesting questions. It's like I guess it's a bit more complicated. I think there is still like a big looming question here which is where do we go next with benchmarks? What exactly do benchmarks look like in the future?
Yeah. So one kind of categorization I find useful is in terms of how benchmarks are scored. Is it completely machine checkable? So you have an algorithm uh like you know not based on language models that uh just checks correctness. So this is you know basically all traditional benchmarks some form of LLM as a judge.
And the third category is just human judging like uh you know non-automated judging you just have humans score the a outputs. So yeah, I'm interested in people figuring out how to do the second category well. And then human grading is I think basically historically would have been ludicrous because uh you know human time is just way too costly and when we had benchmarks with like a a thousand samples and so on it just wouldn't have been feasible. You know, now we're seeing things like much smaller benchmarks or actually even just demos like Anthropic C compiler where there's a single output and you know running the benchmark might be in the tens of thousands of dollars there.
Maybe there's a form of human rating could make sense. There's so much more to explore with these like alternative scoring methods. There's a lot more juice to be had even in the completely algorithmically scorable category. Yeah, it's funny how I almost feel like we've got two polls here that are both very promising and then this tempting but I'm not sure how much I believe in it middle ground of like relying on like fuzzy qualitative AI judgment for assessing AI outputs. We've rarely had benchmarks out in like outside of the like math, science, coding in this domain. Like they're they're just there are some attempts at creative writing benchmarks and I like they're good. I I I mean no no shade, but they they're just not that deep or compelling.
And outside of that, it's like it's not only >> there are things that try law. Well, um, I mean, I'm really interested in like white collar work that isn't STEM, >> but I wish I had the time, but I haven't had the time to like look at literature.
>> Let's talk a little I mean, we can talk a little about some of these. I think it's I think it's interesting. Recently, Epoch wrote a report reviewing three benchmarks that try to target economically valuable work outside of coding, math, science. And I think there are some interesting entries there. It's also interesting to look at how they're graded because none of them are in this first category you were describing. One called apex agents uses detailed rubrics and this is targeting tasks in corporate law, management, consulting and corporate finance, investment banking.
That that's that's the third category.
and they have just detailed rubrics at which an LLM then assesses and it's things like okay did this customer's data breach described in these documents violate GDPR know which you have a copy of over here and here's the contract the customer had with the with their client and it's like the rubric is saying you lose a point if you don't say how clause 103C or whatever was violated or was not violated for what so it's like very it's very granular I think I I believe it that like that this is this is doable.
The other two benchmarks we looked at GDP val from open AI and remote labor index from scale CIS collaboration those are just graded by humans they just they just bit the bullet and I think this is great and it's interesting GDP val is like close to saturated uh but remote labor index is like definitely not no matter what >> how many tasks are in remote labor index and like do you know how much they pay the graders and they don't know all this >> yeah yeah so all all good details I don't remember the exact task number.
It's in our report. But yeah, on the order of 100, not 10. And they don't give us much on how much they pay the graders. Reasonable. The graders are simply asked to they're given the AI output. They're given the the spec from the customer. I should say these are real tasks taken from the gig work platform Upwork. and the they give the reference t uh output which was accepted by the customer and they say if this is what the customer was looking for would this other output like probably satisfy them what I take this to mean is like is the AI output even in the ballpark of the human output like that's a lower bound on the quality of judgment most of these things are to be clear I thought this was an innovative take are multimedia output so it's like kind of a visual gist judgment and right now at least the failures are just like dramatic. The first author of the paper was just describing like a test case they have of like we asked you to draw the Superman logo in Inkscape and you submitted an unrecognizable blob. Like that's the level that models are at here. I think fine grain judgments will get harder, but they also I don't think invested that much in the the human rating. And I sort of, you know, believe it. Like I I sort of believe this is a perfectly reasonable thing. And we're the benchmark at least is good enough to tell us the binary of like can AI at least come close to doing this sort of task. Are the deficiencies more fine grained versus are they not even close?
And the answer there is they're they're not even they're not even so close. And like you were saying, this is like this is a new form for benchmarks primarily.
I'll mention one other form that I think is a good example for going forward which which is incidental. People paid a lot of attention to it, but I don't think appreciated it as a human- judged benchmark, which is the International Math Olympiad. So, for those who don't know, this is a contest where some very smart high school kids write solve math problems by writing proofs, arguments, and there's a very well-developed over the decades process for human judges to score these purported solutions from the students. It's like they're all double judged and the the judges work to sort of they're given very extensive rubrics ahead of time, but they also evolve those rubrics during the during the scoring process as new things come up and there's an argument back and forth of like the judges like get to have to present their assessment like it's it's very involved very labor intensive and Google got their solutions submitted. So Google's solutions were submitted anonymously and that is what Google where Google score like the IMO gold claim from Google is properly like you know judged by the same process for judging humans. I think it's an amazing benchmark and no one batted an eye at this like you know this was a really good benchmarking you know methodological benchmarking win that hadn't you know really been done before and it was just done by hooking into existing human infrastructure for judging work output and I think for category 3 like this is something to be emulated >> absolutely yeah just to give you the opportunity to hammer home your point like what are some other examples of using these kind of existing structures >> I I'm worried I'm not remembering the one that you maybe liked that I said when we were chatting earlier >> but but what one that I can imagine is anything where there is currently a human contest for just you know submitting something like this. So this isn't the one I said earlier but I was just thinking there are various awards for fiction. So, if you want to plug into, you know, have your AI system write a novel and submit it, you know, there there's there's ethical concerns and ways of trying to make sure we're not flooding the inboxes of editors and whatnot, like, you know, but but a very reasonable benchmark in my opinion for creative writing would be submitting a short story to a short story writing contest and have it graded or whatever voted on the same way. I think this is a very reasonable benchmark. Um yeah.
>> Yeah. Yeah. I mean that one's great. I also think just peer review, academic review of of papers, especially as AI becomes more important and like gets used in academia a lot, you should eventually be able to persuade uh reviewers to you know be willing to spend uh like five or 10% of their reviewing time or something evaluating these AI outputs. maybe like AI labs paid them a lot for that time. Yeah, this seems pretty feasible and a way that you yeah just hook into this infrastructure that applies to uh yeah any anytime a paper is reviewed. So it's like you know pretty much any area of science >> and I think uh unfortunately there is something of a refereeing crisis in meaning a labor shortage in certain uh academic fields but this could be a synergistic opportunity to yeah pay the money uh to solve that problem and then have a you know some gated process by which AI AI autonomously authored papers are submitted to nurips or whatever and the benchmark is to win best paper something like that you know get accepted and then and then get you know get whatever accolades I think this is this is pretty good and I suspect you know one thing stepping back one thing that's funny about benchmarking is again it used to be this almost purely academic exercise done right alongside the people who were developing the models now there are you know companies with budgets in the tens of billions and annual budgets in the tens of billions of dollars is growing for AI system development. And surely they're not like, you know, hanging off the every word of of like benchmarks made by little shops like us. They have highly resourced internal benchmarking suites and they are surely trying to evaluate their their systems. I imagine part of what they're doing with the help of the sort of data collection companies is trying to extract just such cases from real world internal corporate use.
So even if some of these processes are legible to us as industry outsiders from the vast majority of industries as peerreview or the existence of contests for public facing you know consumer output that there's lots of cases if you were in the guts of an insurance company like you'd have all sorts of like oh and here's here's the step in the process where the senior claims adjuster signs off on a report authored by a junior claims adjuster and there that's their whole darn job is to do this and that and so I'm sure someone somewhere is trying to make the you know collect data to uh replicate that and maybe even do human trialsa you know with some regularity of okay we like did our messy RL environment approximation of this that's like a a shoddy benchmark so to speak purely internally and we trained on that data and now we have a validation set and it looks like it's doing well and now we're going to do some taste testing which they wouldn't necessarily call a benchmark, but it's it's a benchmark uh whether they call it that or not of having a real senior insurance claims adjuster take a look at this report that the AI system tried to generate. And to be clear, that's exactly what GDPL is sort of trying to externalize and and do, but it's, you know, still these relatively like self-contained tasks. And I think just expanding the scope of these that would take a human less than a day to do or something just you know just okay if that's saturated like let's go to weeklong projects and you know get what you get benchmarking in messy real world context I think that's that's just where benchmarks will go these might look more like case studies and I think this is fine you can have standard method case studies and see I think there is like we should remember that you know every 18 months or so we see a big spike in capabilities. If that if we're really in that regime, we shouldn't feel too bad about doing a can AI be a, you know, do this thing it obviously can't do kind of contest and have that every four months or something like that. And then, you know, you never know when the next spike is going to come. And so, set a baseline of AI not being able to do these things.
Hone your methodology so you'll be able to say when a big spike has happened.
then you know this will uh I think this is a a very fruitful uh mode for for benchmarking to to be in and if anything we'll have less of a gap between the things we really care about and you know the the raw benchmarking numbers and yes it'll be more expensive or not have some of the nice features that current cheapness has which is like if you want to okay there's an interesting like fast model out from an upstart Chinese company like you know can we run it on the benchmark right now is very easy to do and and it like won't be so easy to do that sort of thing, but this is I think an acceptable price to pay that that some somehow the scores will move slower but will come regularly. I think this will be very informative and like we've hardly explored this at all and have plenty to plenty to squeeze there.
>> Yeah. So, kind of like if I think about what's next after some version of Mirror Code is released, a few things that seem kind of interesting. So one is like staying within the mirror code idea of staying within easily scorable software engineering tasks. Seeing as like AIs are pretty good at these on mirror code style reimplementations. Can we see like okay does this generalize if you put AI in something more akin to a situations that humans are in? And so you would be like pushing the frontier with access to you know any codebase that you want any existing tools. And so the examples there would be like can you speed up some like widely used software that is yeah where speed is a real bottleneck and there's already been a substantial amount of effort on optimizing that. One uh example that comes to mind here is a Rust compilation. People really like Rust as a language but complain a lot about the compilation being slow because it you know fundamentally just um has to do a lot more with borrow checking and other things than other languages. Yeah, that feels like you know kind of a natural next step like oh AI is like really good at precisely specified software tasks. Can we get it to a point where this would it would produce an artifact that would actually be useful in the real world? That's kind of one angle. Something else I'm interested in is okay, a lot of people are interested in like the effect of AI on speeding up AI R&D. And yeah, I'm like quite curious to think about the question of okay, how much of those tasks are kind of mirror code style where there's a pretty clear goal or metric. How well does AI do with those? So one thing I'm actually a little bit confused about I should look into it more is okay rebench bench seems to have this property and it sort of seems actually quite similar to mirror code in terms of you know precisely what to do and you're able to get feedback as you go. My impression is it's not the case that every every single rebench task is at uh you know superhuman like more than eight hour time horizon levels and so yeah understanding more whether that's the case and if so why and you know then potentially seeing if they're okay are there benchmarks we can design that like really target this AI R&D thing. One I'll throw in that I think is like the sort of magic ingredient of out of distribution generalization. I think that's a topic benchmarks can take a crack at. And I think we've done a little bit of work with this. We have like a chess puzzles benchmark that shows some interesting patterns in how models perform or sort of, you know, make halting progress on where presumably labs care less about optimizing for this. But if you had a general purpose reasoner that could solve super hard math problems, you should be able to work through a chess puzzle. You want this to be moderately secret and not too high profile. so that the labs don't focus too much on it.
Like Arc AGI became a bit of benchmarks somewhat, so to speak, for specific ideas. One that I happen to like, who knows if we'll make anything of it, is uh trying to push more into physical world tasks. There was this lovely little blog post of someone trying to get Claude to teach him how to make coffee via just taking photos of where he was and asking it for instructions.
And I think that's a interesting interesting because you can imagine all sorts of impacts on the world if AI if LLMs are good as brains for perhaps robots but even just for humans to navigate the world and you know can provide all sorts of skill uplift if they can tell you how to whatever replace this machine part in something in your car or in a factory. So like I think we can just sort of start to look more broadly at what are the bottlenecks to all sorts of economic impacts and there are probably some what I'd say are regular old benchmarks that probably can fit reasonably into into that framework.
How do you envision the benchmark building process when in a couple years when you have lots of AIs that are helping you speed up the process itself?
What do you think that looks like? Have I really drunk the Kool-Aid if I don't have an off-the-cuff answer to, you know, what I'll do with all my agents?
>> Software engineering style of this seems maybe a little more concrete to imagine.
>> It seems like like it's an abstraction letter interacting with a coding agent.
D at the bottom, you might say in this particular function, factor out this particular thing into a helper function.
And you know, that's like basically like typing it yourself.
It's so specific. it might just save like a little bit of time versus doing it manually yourself and and actually like sometimes I do this for an instrumental reason which is then the AI has in context that this has just been done whereas it's like a little bit more annoying to get it into the context if you do it manually. So that's sort of at the bottom and then you know you can go up and up and up this abstraction ladder where the instructions you give the AI are more and more high level. I don't think I really have a useful sort of more concrete picture or prediction.
Beyond that, I think there's like there is a bottleneck in some benchmark design around taste in tasks. I do feel like it would be a big unlock if AI systems like had some of this taste that I feel like they don't have a great job they don't do a great job with today where where for example if I say like you know give me examples of problems that fit the the rubric for open problems like I haven't been impressed with that they turn up and it's a little bit of an unusual >> yeah they don't have great taste for for like coming up with miracle code targeted programs but you know just the fact that they know like every single thing in in computer science or in computing. So you can just ask it to keep generating more ideas and um you know then you can pick based on based on your taste and also even during the development of this benchmark like Opus 4.5 and 4.6 ethics. I feel like they're already better at coming up with suggestions that meet more of the criteria.
>> I mean, like I think the in case it's not obvious like the a couple steps up the abstraction ladder would be human sets up the human researcher sets up the framework for the benchmark and with with plenty of assistance on coding whatever you know infrastructure is necessary and then the that you come to the part where you have to fill out all the tasks. I mean often you sort of start with that to make sure there are some tasks but you're at a point where you want to like get 10 or 100 of these things and there's some work to do for to even come up with what they should be a and you ask an AI system and you can sort of trust you know trust its results that it will mostly come up with good ideas that it's worth your time to sort of engage with in quality control instead of a couple steps down where it's like I came up with the task and now I'm going to get a lot of help from it to implement or I see what's wrong with the current, you know, version of this and I'm gonna give it some feedback and have it take a turn on the the code or whatever. For the the chess puzzles benchmark, we wrote Gemini 3 Pro, like wrote all the code for it, but it was like me looking at the output and saying, "Ah, these chess puzzles are like lacking this feature like or our search for chess puzzles of the characteristic we want is like not turning much up. I think XYZ is wrong.
What do you suggest?" that like it's it's helpful and productive but like a human couple obviously building the whole thing from scratch when do we just say AI I would like a benchmark in this domain like I don't know I mean yeah presumably it's on the path out there but that does feel a couple turns away call it 6 months to 3 years you know conservatively but I skew towards the the later end of that >> I think this is interesting also a little funny we need like a benchmark for benchmark tastes you can see if like the AI can themselves make the benchmarks >> yeah I I mean, I do think like some of our benchmarks have elements of taste baked into them in these kind of like don't expect it to generalize too well kind of ways, but but like maybe useful angles on it. Like even mirror code, some of the more complex programs like you need call it architectural taste to make it not fall apart. And we'll see if the models have that for the harder ones or some of the open problems the like need what a h you you might need what a human would call taste for the harder problems. We'll see in hindsight. I don't know.
>> Okay, cool. I think this is a good place to end. So, thank you both for coming on the podcast and it was a good chat.
>> Thanks. Thanks, Anson. Thanks, Don.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











