LLM benchmark evaluations face significant challenges including implementation differences, parameter effects, and infrastructure variations that can dramatically impact results; for example, harness selection alone can cause 15% score differences equivalent to 6-9 months of model progress, and under-elicitation in benchmarks fails to reveal true model capabilities, making it essential to use models in evaluations to their maximum potential while avoiding reward hacking.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
Big Techday 26: LLM benchmarks in the time of agents - Florian Brand, Prime Intellect本站添加:
when a new LLM is released, I guess many of us uh often look at the benchmark scores first and um often we ask ourselves the questions, how much does this actually mean or what what do these numbers really represent? So our next speaker Florian Brandt is a research engineer at Prime Intellect and uh he works on evaluating LLMs in different domains and um yeah he will give us insights into evaluating these systems especially in the times um of agentic systems and he will give us insights into um what these scores actually mean.
So um we will as usual have time for questions after the talk and if you join remotely please feel free to um put them in the zoom chat and with that uh please join me in welcoming Florian.
>> Yeah thanks for for having me. Um also thanks for for inviting me. Uh the interesting thing is that at every LLM release, some people look at one of the numbers or one of the uh headline results and then instantly dismiss it because they say, "All right, all these evaluations I don't really care about them because I am able to measure vibes. So I can use a prompt and it can the LLM then generates a picture of a pelican of a ball game with a ball bouncing around and that gives me so much insight that I instantly know how good this LLM is.
However, that does not really work and it's not really what evaluations are used for. Evaluations are used to measure and to put some number at a certain capability and to approximate the usage of set capabilities.
The capabilities we cared about or we still care about changed over time. So I will go through the history and why we changed the things we measured. Um but in general I think you can um differentiate between things like knowledge, some narrow capabilities, for example, tool calling or um long context capabilities. We care about things like coding especially now that everyone uses cloud code um and codecs.
We care about white collar work for example accounting or or law or creative writing and role plays. So all these things um are things we want to see how good our model performs so we can make an informed decision and for all these things we need to to go through the history to know why we measured the things and how we ended up uh where we are right now. So in the early era of uh LLM so like 2020 to 20 uh 22 we mostly cared about evaluating the pre-training knowledge. So we used some questions which we found uh in the internet then took some poor undergrads and let them annotate the data to see whether the these questions are correct and then we used this to to uh see how good the model is at picking up the knowledge from the the internet. for examples are MMLU or true full QA and the the questions all always look similar. For example, this is a question from uh MMLU where the model is asked to um calculate the GDP per capita over a span span of years. And these are the things um we we mostly cared about and the models got better and better and better. Um, and the general setup for these evaluations was rather easy. We took the question, we threw it into an LLM, we looked at the uh at the answer by using a grader, and then we got the final score. And the final score is the thing we compared the models against. So there basically was no way to to [ __ ] it up, right?
Well, we got creative with how we do things and how we have to uh do things.
Um so for example, even even the most simple example is that we changed a lot um in the initial prompts. So we formatted the prompts differently and that had an impact. Even simple things like adding new lines um resulted in different scores or as we have discovered relatively quickly uh asking the model to think for some time before it uh needs to answer the question also impacted the result. So now we have um one thing that we can change that directly influences our results and that makes the comparison between uh between models or between these evaluation scores quite hard.
The second component is the LLM. I mean that's the static thing. So surely that is the thing that everyone agrees upon to use correctly, right? Well, even today we have things like uh sampling parameters and they have a huge impact.
So basically what we do is that we can call the LLM and we can change things like the temperature and for some reason even today it hasn't huge impact. So uh here I used the um biggest quen model the biggest open quen model just changed the temperature from from 1 to 0.6 and it improved the score by uh 3% um or 10% uh relatively. So I got a free performance boost by just changing the temperature to the recommended ones. And if you want to also have this free performance boost for your uh applications, you should look into things like the generation config uh for open models or in the readme or other documentations.
Um and the last thing of of our simple pipeline uh is the grading function where we use things like regular expressions. Obviously, if we change our regular expression or our regular expression is bad, it might miss some correct answers and our score might be lower. or we use things like a second LM which will be given the uh correct solution and then it's asked all right is this solution from the first LM the correct solution and it obviously this choice of the second LLM also has an impact on the final score because a better model or a smarter model is uh able to differentiate between uh correct or incorrect. So this also impacts our our final score.
Um but over time we didn't really care or we cared less about things like um like general knowledge and we wanted to see how good they are at reasoning. So we cared about the models combining their pre-training knowledge to answer harder and harder questions.
That also meant that we weren't able to use our poor undergrad students. But now we got some access to some poor grad students and PhD. So we just took them to annotate our our questions. And examples for for these kind of evaluations are GPQA for uh chemistry and biology or humanities last exam or frontier math for for math. And an example question from humanity's last exterm is uh is a question what the rarest noble gas on earth as a percentage of all terrestrial matter was in a certain year. And the answer to this question is organoson.
But some people looked into into this question and found out that the answer is not a gas. It's not even noble and it's not terrestrial. So, HLE which is a benchmark done by um by experts or by supposed experts has a lot of errors and a startup for uh chemistry looked into all the chemistry and biology questions and found out that onethird of our question of the questions are wrong for Frontier Math which is like the hardest math benchmark out there. um the organization behind that who also used um professors uh PhD students or PhDs um they also ran uh a checker or they used AI to check all the solutions and found out that onethird of their problems likely have some invalid things or are likely wrong as well. So we are now in a state where for these reasoning questions or for these uh reasoning benchmarks um we have a high error rate because it might be over the limit of what some humans think uh is is the correct solution.
And to go back to our our pipeline because we scaled up the models they were harder to use. So we have to use an inference engine like VLM or SG lang or we just call one of the APIs and that's that will be a recurring theme. Of course the choice of the engine or of your API has an impact on the results. So on the left side we see that Kimmy K2 which is an uh open model was evaluated for their precision on on tool calls and some providers performed less than 50%. So basically like half the calls were just bad badly formatted because they did some mistakes uh in their implementation of the model. So even if you used uh an API provider um you just got a worse model because you you choose chose the wrong provider for that API.
And on the right side uh we see experiments that I've run on uh GPTOSS which is one of the most popular models and I've run these experiments months after release through all the providers on uh open router and the thing that we found out and that shocked me if I'm being uh honest is that some providers did not implement the reasoning effort correctly. So in this example, if you used GPTOSS through open router um uh on the AWS and you specified that you wanted a high reasoning effort, you got the model on medium reasoning effort. So even if you relied on AWS for for this thing, uh you just got worse results because the provider [ __ ] up.
All these bugs are fixed nowadays but it goes to show that relying on API might not be uh the end all be all. So we are saying all right I just use my own GPUs I use VLM or I use SG lang but especially the inference engines at model release also have some mistakes.
store for the Kimmy K2 model. They found out that only 20% of all uh of all tool calls were executed directly or correctly at the at the initial release.
Another area of uh of capabilities we tracked more and more because we started using LLMs more and more inside our products for example in uh rack pipelines is that we wanted to embed them in our products and call our tools for example web search or our internal search and for this they had obviously uh to uh had to be able to call tools correctly.
So um we needed evaluations to reflect these capabilities.
Examples are for tool calling um tow bench where we basically have the setup where uh which we care about and the um customer support. We have the LM that acts as an customer agent and we have a second LM which acts as a user and the customer agent has to call tools to look up in the database what the correct things are or to resolve issues from the user or we have things like ifbench for instruction following graph walks for these long horizon uh for these long context um problems. And the good thing about these evaluations is because they are so narrow and they only care about one very specific capability that we can create them programmatically or synthetically. So basically basically we can write um short Python script or a larger Python script to uh create those evaluations then give it a human pass to see which examples are valid or which examples are interesting and then use that and if we can create the evaluations uh synthetically we can obviously also create the training environments uh synthetically as well and That's what we did at uh prime intellect with general agent as a example for uh for uh these evaluations or for these synthetic environments. What we basically did is that we created um a tool calling environment with over 4,000 tasks um and then the model um needs to call tools at at an increasing difficulty um to solve the problems of a user. And we found out that training on these synthetic environments then improves the scores on um tool calling benchmarks.
for example tow bench or um bfcl and so on. So we can easily create the training data to solve the evaluations we also easily created.
These days the thing we mostly or maybe solely care about is automating work or helping us um at work.
uh which means obviously white collar work. The poster child for this is uh coding as everyone starts or is using cloud code and code ex extensively. Um people write code less and less and use these models more and more. Um but also things like accounting, like law, like consulting, all these things are interesting to see how well models are able to do these tasks.
So we also need to create evaluations for these tasks to see how well they are able to to reflect it. Which means we need to pay uh these uh the worker in these professions um a good amount of money. So the hourly rates um to create and annotate the data which these days means that just the data annotation part for uh these more interesting evaluations are in the mid five or even mid six figures and the trend is going up and not down which is also really bad for academia because it means like academia can't create white collar uh work related evaluations. because they are priced out to pay uh those amount or this amount of money uh just to create one evaluation for one paper. Um I think the most popular example for for these evaluations uh is sweep bench verified where the model is placed in a GitHub repository.
it gets the g it gets a real GitHub issue and then it's tasked to um solve that GitHub issue and see whether the tests passed after it has solved it. But we also have things like terminal bench which is also coding related um with more broader tasks from solving uh bugs but uh up until creating new things machine learning tasks are also in there. So it's more reflective of broader coding tasks. Or we have things like Apex agents and GDP which are broader for white collar work where the models have to create presentations to mimic consulting. They have to create um defenses uh for for legal work. They have to create spreadsheets and multi-heet spreadsheets to calculate um accounting and all these things are coming are more and more closely resembling the real work of us uh humans and of us um white collar workers.
And the new thing here is that these models are used in a product or more specific a harness. Um the most popular harness is cloud code.
But that also means that for our evaluations we need to reflect this reality. So we also need to include our harness um into the stack to to run these evaluations.
And the harness is basically a set of tools. The tools are described by a prompt, uh different settings, things like compaction, the overall loop, how a model is called and all these things um define define a harness.
Um and also uh for example one of the more popular tools is to read a file and all the harnesses have a bespoke tool to limit the output um of this read file tool so it does not clog up the the agent by reading a wrong file and you of suddenly have uh tens of thousands of lines in in your context and you just bust the model. That's basically why we create harnesses in the first place versus us just dropping it into into bash.
Um, examples which everyone uh will probably know are things like codeex, like open code, like cloud code, cloud core work and so on. We have a plethora of these uh right now and they are used extensively.
Um and also means that we have the thing that we have seen uh in the talk previously that these LLMs are increasingly trained in their own respective harness.
So the latest GPT model is trained inside of codeex. So it knows the codeex tools really well and it knows the codec tools better than any of the alternatives. So if you use um GPT inside of an other harness um you might have different results or different experiments uh experience because the model does not know or is placed into a an environment uh where it has tools uh that it has never seen before.
And the impact of this harness or the selection of this harness is the biggest impact uh in our in our whole pipeline.
Um we looked into the threebench verified scores with the same model. So the problem or the evaluation is the same, the model is the same. The only thing that we that is changed is the harness. Um and we looked into different reported numbers and for Kimik K2 uh we had three different harnesses. We had one from the US uh center for AI safety.
Um and the model performed quite badly uh for for that time. Um when we looked into the scores from mini3 agent which is from the benchmark providers themselves or creators themselves. It basically just places the model inside of bash. And then we looked also into the reported score um of uh by Kimmy themselves and the difference was 15% from uh the lowest score to the developer reported score and 15% on this benchmark basically means six or nine months of model progress. So you can jump ahead like two generations of models by using the proper harness and using the model in the correct setup or you can use today's models and feel uh like it's um summer 2025.
Um our harness obviously has to run somewhere on on something. uh which means we place it in a sandbox and the sandbox has to run on some physical hardware and obviously that means that the hard hardware that we have access to or that we used for these sandboxes has impacts on the final results as well. There are some benchmarks which require really specific hardware. For example, there's kernel bench where models are asked to create um LLM kernels for Nvidia GPUs and for this you need the specific Nvidia GPU because every Nvidia GPU results or needs a different kernel or we have uh benchmarks like post train bench where one model is asked to fine-tune a different model on one H100 and if you change the H100 to a B200 it you can't compare you can't compare the stores uh at all. So you need to have these specific things but for many of the software engineering benchmarks or um white collar benchmarks we basically get things like you need two CPU cores, 50 gigs uh of of um SSD and 2 gigs of RAM or something and that's it.
So everyone um kind of uses these settings. Um but these settings these standardized settings also have an impact on the overall result.
Um sometimes a model run um run a command which is resource intensive which went would then spike uh the resources of the sandbox and kill the sandbox in worst case. For example, if I if I run something which is really memory intensive, I only got two gigs of memory, um the sandbox would would get killed. But if we allow like these short bursts um outside of our hardware requirements, we would get better results or less infrastructure errors.
Um and there are other benchmarks like terminal bench 2 where each of the tasks is restricted by wall clock time. So each of the task has for example 10 minutes, some have 30 minutes, some have four hours and these timeouts are dependent on your uh latency to the model provider. So if you are unlucky and uh call the anthropic API at a bad time where everyone in the US is awake, you might have worse results because the model just takes longer longer to to answer.
Um or if you have worse hardware where the sandbox runs um your commands um take longer to execute. So the model has less time to observe the results and to improve its score. And someone ran um run GPT 52 and found out or initially used the timeouts uh that the evaluation creators uh set and they their score of the higher reasoning effort was lower than uh the lower reasoning effort. So they increased the timeouts by five times. So basically gave the model enough time to solve the task and afterwards uh they had scores which reflected that models are at least on a similar capability level. So here we got a 15% increase in in absolute terms by just giving the model more time or better hardware to to execute. um which also again influences our our um our final number. So we have all these different components and each of these components has some impact on the final number. Some of them have a way higher impact than others. But it also kind of means that you can't compare any score to any score run by someone else because they have something changed in some way or capacity um and then you are um you are out of luck. It also means either you need to run these things yourself or you need to rely on one provider or one um creator which then runs the same benchmarks under the same settings every time. So you can at least compare these scores.
And if you want to to build benchmarks or our current benchmark um landscape has changed drastically over time as we have seen. We started with a single LM which we called over an API and now we are using LMS with harnesses. So they became big systems. And if we look forward just a bit, we can see that things like open claw or also the the other um harnesses have more and more things incorporated in them with persistent files and users talking over days, weeks or months with their uh with their AI system and we need to reflect this reality as well. So we are moving more and more towards harder to towards more um complicated infrastructure which needs to accom uh which needs to reflect the reality of people actually using these models.
And our runtime for a single evaluation has also increased a lot. Like these single general knowledge samples were done in a few seconds and now for these coding benchmarks we are running them over hours and now going into the um area of running them for days just to get one sample done and that has created a lot of current and uh future problems for these uh systems. First is infrastructure. I think that should be clear by now that there are a lot of moving parts. Um the second thing is that the models are really clever to find ways around your setup which is called cheating or reward hacking. So you need to find some ways to to restrict the model.
uh the biggest problem where I will go into more detail is that a lot of benchmarks are under eliciciting the capabilities which means the benchmark creators don't work hard enough to really see what the models are capable of these days and last but not least uh is cost the cost for creating benchmarks as I've said is now in the mid five to mid six figure range for some of these benchm marks and running them uh especially as more infrastructure is uh is involved is also exploding.
Um so talking about cheating is uh it's the thing um I've I've alluded to earlier is that you have some setup and the agent loves to find some way around your setup to get to the solution faster.
um which is also known as uh reward hacking during training.
Common techniques which we uh have observed is uh for for coding related benchmarks for example that some models found out that the docker images the uh the agents were placed and still had the git history. So they just jumped into the future, found the the comet that fixed the bug, then uh used this this code, applied it and solved solved the problem or um some benchmarks use the use the open web to uh to make it more realistic um and then have some block list so the solutions aren't easily findable. um HLE is one of the benchmarks where people move more and more towards uh tool usage which means um that they enable web access and then they say all right I don't want to uh give the agents access to hugging phase because the data set is hosted on hugging phase and that should be enough so that the agent isn't able to find the solutions unfortunately there there are a lot there are a lot of third party websites um which basically rehost tagging face uh with a solution.
Um an other thing I've I have seen more and more is that the agents found out how to embed the test results or uh the binaries of uh applications inside their code to pass the to pass the test results relatively quickly.
or uh that they found out ho how the test um is run and on which hardware the test is run and try to use this to abuse the properties of the underlying hardware.
And common defenses uh is to remove every agent um that the agent u does not necessarily need for the evaluation. So no access to git, no access to the web, no access to weird uh third-party uh bash scripts and so on and only restricted to the most um to the lowest set of of tools um that is needed.
Uh and the second that is also uh something that we that we are seeing more and more is that we use a second sandbox just for for scoring the result and we reset this sandbox and remove a lot of the things the agent has access to.
And we can also use a second LM to observe the first LM and then say all right you are now cheating the benchmark uh I stop the execution here. But obviously because these things are now running for hours means a lot of increased cost.
Um yeah talking about uh under elicitation which I've said is the um biggest problem for for benchmarks right now. Uh I want to give you two concrete examples. The first is ARC ADI3 which is this benchmark with different games and um it has multiple levels. So the uh LLM is placed into a level has to figure out the mechanic and has to apply this learning or these learnings to a second level where maybe a new mechanic gets introduced and then a third level and so on.
uh and it's a really hard benchmark and the model creators of ARC a or the benchmark creators of Arc AGI3 said that the official scoring does not allow for any harnesses. Um so just the raw um model has to be used because they found out that people will create some really weird and niche harnesses just to solve arc AGI free and that defeats the purpose um of their benchmark.
So what someone did on Twitter is that they used codeex and just gave it the problems uh with slashgo which in codeex just tells the model to continue until it finds the correct solution and it ran over 12 hours um used hundreds of of dollars in API costs and it solved uh 61% of the of the public problems whereas the official score for this model on ARC AGI3 um is 0%. So now we have 0% of of the model versus versus 60% of the um versus 60% when we use when we use codeex.
Um and another uh benchmark which is really new and up and coming and the the idea is something we have uh we have seen uh previously is a benchmark called program bench where a model has to explore and to reimplement an application.
Um for for this the model is given as a binary does not allow for any decompilation and the model also has no access to the source code and then is asked to replicate the binary or uh this program just like uh like it does basically to clone any application.
Uh the time out they restricted the model execution to 6 hours. So after six hours the uh the model is cut off. They use the mini agent harness only. So no other harnesses. At least right now they plan to use others as well. Uh and they have a really minimal prompt which basically says all right this is your task. Uh and I want you to to replicate it.
Um and the model is then created on a automatically generated uh but hidden test suite. So the model has to submit the uh application uh and does not get any insights to how far it is um in in its uh reimplementation and the results are that basically no model is able to do anything.
Um the resolved rate by the way means uh so scoring in this means that uh to be to get uh resolved or to resolve one of the tasks you need to replicate the whole application and uh do 100% on all tests.
So if you do 99% it's not uh counted as uh as solving it.
We did so we means uh people at uh epoch meter and I did a really small part in this um did a similar benchmark which we started a bit earlier which is called mirror code and mirror code is that the model has to explore and to reimplement a program the program is given as a binary there's no uh decompilation and there's no source code access but the the big difference is that some of the test cases are given to the model not all of them. So the model is not able to just replicate the test suite and um create a lookup table and um the model is able to look or to run the test suite to to check its results and we don't impose any limits in terms of um how long the model is able to run. And our results are that these kind of tasks are basically solved these days. So this is an application uh I think it has 10,000 lines of code. Um and what we basically did is the more tokens we threw into the model and we are now over 100 million tokens just for this one task. Uh the model just kept climbing and climbing and climbing until it uh fully resolved. So basically our our results were that um these kind of things are possible right now and it's only a question how much money you throw into into these things. Um and speaking of money, if we were to do this for the whole task suite and with more models, just running the evaluation on its own would cost us over $100,000.
So looking or um again I think the biggest biggest problem we have in the benchmarking space is that we are only able to to make informed decisions based on what we measure and that measurement has to um has to look into the correct uh elicitation. So we need to use the models correctly and push them as hard as we can to answer questions like um how well they are in offensive cyber security or to answer questions uh like how big the gap is between open and closed models because only if we are able to uh answer these questions we then can make um informed decisions which for cyber security means different things uh than answering the question between the or how wide the gap between open and closed models is. So if I want you to take one thing from this whole talk is that you should use your models in evaluations um to the maximum potential as as just possible without spoiling the whole solution.
With that I thank you for for listening.
Yeah, thank you so much for this great talk. I think that was that had many insights in it and I'm sure we will have some questions from the audience.
Thanks for the amazing talk. Could you please show the plot again with the number of tokens in the x-axis and the performance?
So um with the opus 4.6 it's really interesting. There's a plateau for a very long time actually especially given this a logarithmic plot and then it jumps again. Why is that? Um it's a good question. I think uh also um that uh you for especially for reimplementation you need to lay a lot of the groundwork first and then you can so if you implement one of the more um broader things in an application you then can easily clear a lot of tests at once and that's one of the things we we could see here but yeah it's it that jump also also surprised us. Thank you.
>> Uh thanks for the amazing talk. Um I wanted to know if you have any thoughts on how one might go about evaluing uh system like open claw if one is like interested in the evaluating the memory or continual learning aspects of it. Yeah, we are seeing all we are now the first open claw related benchmarks. Um, which basically test or starting to test these things. Um, basically means that you need to install open claw. So you need to have the harness in in place. Um and in in terms of infrastructure uh and in terms of um design the the hard question I think is that you need to find out what you want to evaluate um concretely. So if we are talking about memory, do you care about the stored memory in markdown files or do you want to probe your agent after 100 or 200 turns whether it remembered some facts uh about you and all these things um mean a completely different setup in terms of VA uh evaluation. Um so yeah depends on what you want to evaluate and and uh that can then inform your infrastructure.
I was wondering uh what your experience or what your current understanding is of tasks that are not not easily v verifiable like in the beginning you mentioned for example law or accounting uh as like for German law it's quite difficult to to have a verifiable answer so um there is for example this um white collar work benchmark called apex agents And they have law as one of the uh one of the areas. And for this they place the agent in a file system with a lot of different um different files. And these files are law related to the client and um it then the model that is evaluated has to read all these files to come up with an answer and this answer is then uh judged against a human written answer or a human written rubric with an with an LM. So they took a lawyer and said all right for this um for this problem BLM has to answer this thing this thing and this law uh or maybe a concrete number and then uh they handcrafted a rubric by the lawyer >> but it does come down to >> yeah sure but it does come down to an LLM as a judge uh approach rather than having I don't know like run a test suit that that actually proves a one onetoone correspondence.
>> Yeah, exactly. Because you need to have like the lawyer has to to write the things and then the LM as as a judge because there is like you said there is no verifiable answer for for law.
Um regarding stuff like SWE and so on.
Um there was a blog post I think by OpenAI on SWB bench verified which like kind of said ah no it's kind of a bad benchmark at least at this point because of um like lots of tests being too specific. Um do you just like is this something you have to evaluate even for like early benchmarks? Um or do you like is that something you only consider when it plateaus? Because I know like lots of new stuff is like automatically generated and so on and there I feel like that danger would be even higher.
For example, the program re-implementation thing >> is literally that right it's please do it as specifically as possible. Um yeah um I think you need to be at every stage in your evaluation uh creation. You need to check it at every stage. Uh and for three bench verified I also looked into the samples and wrote a blog post about them uh like six months or eight months before openi and one of the problems is that some of the um prompts were too ambiguous but like that was one of the problems and the other problem in terms of the general setup is that some of the tests were too strict um and that's why the benchmark later got killed uh or won't be evaluated anymore.
And to solve the ambiguity problem is the thing that we are seeing more and more uh is what SWEBench pro does that I didn't like at the beginning. So, Sweepbench Pro basically has the same setup as Swebench verified with a human written initial GitHub issue and then they took another human who wrote a way longer issue description which is more in-depth but then you can evaluate that the model actually solved these tasks and didn't have to guess what the tests might look like. So you need I think a lot of uh evaluations right now need to be really specific in the things and in the initial prompts that they care the model to solve to then uh evaluate it fairly against uh against the setup.
Thank you for the interesting talk. So um let's say I want to evaluate our LLM setup which is like with a rack based um um approach that has like a lot of knowledge connected to it. So we don't have a benchmark in place because um also like spending so much money on a benchmark is probably not something I can convince my boss of. But what I do have a lot is like the interaction of the users with uh the system and can I somehow like generate valuable insights from the users interacting with the system and say like okay well here we use this model and we used this part of the the array based system and like from all of that I can gather gather like rough insights or is this worth it or would you say like no like skip this like you got to to like really a benchmark.
>> No, I think it it is almost better than than like static benchmarks because you have live data. The big the big question is what you measure and um how you are able to to get these things like if you just have the raw interactions without any outcome, it of course is hard to to to judge what actually uh what actually has an impact. So, but if you have for example uh the user doing um like giving a thumbs up or down at those interactions would uh make it very easy and from that you can also then bootstrap uh the evaluation.
Okay, I think we have time for one more final question. I think he was first.
Um do you have a recommendation for a benchmark for relatively simple use case just basically customer support on a knowledge base so more or less rag?
>> Yeah that's models that's basically tow bench towbench is exactly uh this kind of thing uh and now in their third iteration they started um adding that the model has to discover tools as it goes through the interaction with the user. Um, and it's it is really hard.
Um, but yeah, Towelbench is just what you've you've described with a user or one LM playing the user and another LM having to search through a knowledge base.
>> Okay. So, um I I guess Florian will be around a bit. If you have more questions, feel free to approach him.
And let's thank him again for this great talk.
相关推荐
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











