Install our extension to search inside any video instantly.

Opus 4.8 vs 4.7 Benchmarks: Higher Recall and Reduced Noise Explained
Added: 2026-05-29

1,882 views4212:34CodeRabbitAIOriginal Release: 2026-05-28

Opus 4.8 successfully balances higher recall with reduced noise, transforming AI code reviews from a source of friction into a high-signal utility. This refinement proves that better reasoning is the key to moving beyond pedantic nitpicking toward genuine developer collaboration.

[00:00:00]Hello and welcome everyone. It's another Thursday and another big model release.

[00:00:04]Today we're going to talk about Opus 4.8, the model that just dropped from Anthropic. I'm here with Gautam on our Applied AI team at Code Rabbit. Gautam, thanks for joining in.

[00:00:13]>> Thanks for inviting.

[00:00:14]>> Last time we did the pod, uh we talked about 4.7, obviously, not too long ago.

[00:00:19]And now 4.8 is out and the story feels a little bit different.

[00:00:23]Can you give us a little bit of a brief?

[00:00:24]What is so exciting about this model and what changed?

[00:00:27]>> Sure. Um when we did Opus 4.7, we noticed some mixed reception for Opus 4.7 internally. Even the sentiment was almost different in from different online forums.

[00:00:37]But when we did Opus 4.8, we saw that the model has improved a lot, you know.

[00:00:43]So, Opus 4.7 being too much defensive or too much optional checks, Opus 4.8 does a better job in terms of code review space.

[00:00:53]>> Mhm, okay. And what does that particularly mean? Like for me as a developer trying out the model today, what would I see to be different when I'm using it for coding or when I'm using it in code review?

[00:01:04]>> In terms of code review space, if you have noticed Opus 4.7, you would have seen ton of review comments saying that have this null check, have this defensive guard. Those kind of review comments would be a lot. In terms of Opus 4.8, with higher reasoning models, they tend to tame that down, I would say. So, the number of review comments which were not necessary or which were too defensive have gone down. That's a major improvement, I would say.

[00:01:29]In terms of our benchmark, I could see that the score is almost similar to our baseline. Our baseline is composed of different models from OpenAI and Anthropic. So, they got better with Opus 4.8 when compared to Opus 4.7.

[00:01:42]>> Mhm, I see. And you've been running benchmarks on a lot of these models, right? You basically started um even with Opus 4.5. And then you told me there was a big shift to 4.6 as well.

[00:01:53]Can you explain?

[00:01:54]>> Sure. Uh I started doing the evals for 4.6, 7, and 8. So, 4.6 is the time when they added the adaptive thinking. So, when adaptive thinking first came in, it was using a lot of tokens. That Anthropic calls this as a feature that they let the model to decide how much it has to think. So, for complex task, it thinks a lot. It makes lot of review comments. It makes lot of defensive guards or review comments for that.

[00:02:18]In terms of code review space, this is a bit sensitive because there is a fine line where a developer can say that this is bit noisy or this is useful.

[00:02:27]In terms of 4.6 and 4.7, most of the developers or the feedback from online forums were like model's being too defensive or guarded.

[00:02:36]Those would be better for coding space, but in terms of review space, it's a bit tricky area to solve.

[00:02:42]So, in terms of 4.8, they got the problem down or they narrowed the problem space as I would feel.

[00:02:48]>> Mhm. So, effectively that would mean, for example, for 4.7, we saw an increase in comments that we think some people would call it even noisy or overly defensive, as you say. And then with 4.8, out of the box with much without much fine-tuning of the prompts itself, we saw already that it would rise up to a similar level of what we use in our baseline, which is the fine-tuned version of Code Rabbit, so to say.

[00:03:15]>> Exactly. You covered it all. Opus 4.7 was making a lot of defensive checks, and as you could see, Opus 4.8, without much prompt tuning, we got what we have in our baseline. So, I feel in this particular code review space, it's a good model to use and try it out.

[00:03:30]>> Mhm. Okay. Can you maybe give some examples of maybe some particular bugs that you found Opus 4.8 catching, you know, maybe at a at a particular glance, but particularly well?

[00:03:41]>> Yeah. There was one particular bug which I could recall from my memory. There was one instance which was very subtle to catch, but Opus 4.8 had a valid reasoning. It came up with its own example saying that this is the reason it could go wrong.

[00:03:56]>> Mhm.

[00:03:56]>> So, it came up with this example. It had a turn of thinking tokens, and it had a justification to it as well. So, Opus 4.8 did a better job in reasoning through it and try to find those kind of issues which could happen very rarely in production. And it was not even overly defensive. I would say that the issue would still happen, but it means that model needs to think a lot to find that issue.

[00:04:17]>> I see. And then you also told me that um the verbosity as well as like the entire feel of the model changed, right? The the structure of the comments that were post. Exactly.

[00:04:28]>> Uh when we did the evaluation to compare between 4.7 and 4.8, the edginess has gone down. So, when I say edginess, it's being more authoritative. 4.7 was being more authoritative, but 4.8 was more of a general vibe, you would say. And even the comments or the responses which we got from the model were more clear and it's easy for anyone to read through it.

[00:04:49]You don't have to do a second read or third read to understand what the review comments are.

[00:04:53]>> I see. And then let's talk a little bit more about, you know, hard numbers. So, uh many people measure benchmarks, and especially code review tools between precision and recall. What were the differences there between the previous models?

[00:05:06]>> That's a good question. Um And compared to our baseline, which is composed of different open and anthropic models, this Opus 4.8 was having a recall by 4% increase in recall by 4%. The precision stayed relatively flat, which is good.

[00:05:21]In comparison to Opus 4.7, 4.7 had a lower precision, which Opus 4.8 was able to catch. So, Opus 4.7 had a lot of comments, so your precision went down.

[00:05:32]That was one of the main reason why we were not able to use Opus 4.7 in our code review, apart from long-running task or very high complex task. So, this is a good model in terms of the scores as well.

[00:05:42]>> I see. And then there are also different thinking levels that you can set different from the adaptive thinking mode. You can also preset marks and you know, X high and all these different modes. It seems like that also makes a huge difference, right?

[00:05:55]>> Exactly.

[00:05:57]Uh when we did our benchmark stuff just to analyze or gauge how the model performs, we tried with two different configurations based on our internal framework. One was with medium and one was high.

[00:06:07]So, in terms when we try with lower thinking levels when compared to higher, we saw that the model was making more defensive ticks, but when we make the model think a lot with higher reasoning levels, the model is smartly finding a reasons to say if a review comment is valid or not. So, we could see that recall is improving with smarter thinking and precision still stays the same. But if with with lower thinking level, our recall went down by marginally 2% or 3%.

[00:06:34]>> Okay, so you're saying that the model basically validates its finding better when it's on higher thinking.

[00:06:38]>> Exactly. It comes up with its own examples and tries to justify it in its thinking trace, you could see.

[00:06:43]>> I see. So, Anthropic in its prompting guide as well highlights the fact that this model is particularly good for long horizon agentic tasks. Do you see a correlation there?

[00:06:53]>> In terms of long horizon task, yeah, I could see a correlation here because there are certain bugs in a data set which needs a model to think a lot. It has to reason by itself and just to find a use case which could happen in production. So, I believe the model is going to be good and based on the data traces we have and the thinking traces we got from our events, I believe this is going to reflect in our review spaces as well.

[00:07:14]>> I see. Let's talk a little bit more practical. Let's say we shipped, you know, the model today 4.8 into a system.

[00:07:20]How would the review feel different for a reviewer that experiences Code Rabbit tomorrow?

[00:07:25]>> Interesting. Uh first thing they would see is if the words would be more clearer. That's for sure based on the evaluations we did, the sentences would be clear, it'll be detailed as well.

[00:07:37]The second thing is you would find a lot of nitpick comments if you have an assertive mode turned on. So, if you have an assertive mode or turned on, you'd see all the defensive checks and netpicks.

[00:07:48]If we would have used Opus 4.7 previously, those issues would have been flagged as major and minor.

[00:07:54]>> Mhm.

[00:07:54]>> So, I believe that's the difference between Opus 4.7 and 4.8, and that's what 4.8 could bring in here.

[00:07:59]>> Yeah, just for people who might have not used Codium before, what is the assertive mode?

[00:08:03]>> Assertive mode is mostly to find comments which are too netpick, or I would say which a developer wants to address or not. For example, defensive checks or naming checks, those kind of stuff.

[00:08:15]>> Okay, meaning that there will be less if the developer turns on assertive mode and we enable 4.8 as a model, there will be more comments that are moved into the assertive mode that have would have been previously um highlighted in the normal mode.

[00:08:30]>> Uh if with Opus 4.7, all this netpick comments would have been in major and minor.

[00:08:35]>> Mhm.

[00:08:35]>> With Opus 4.8, it was able to properly gauge it and say that this is a minor or a netpick comment.

[00:08:40]>> So, bottom line, less noise.

[00:08:42]>> Less noise. More seeking. Exactly.

[00:08:44]>> Yeah, so I bring in the audio.

[00:08:45]>> Yeah, no, very good.

[00:08:47]Um so, for anyone who's listening uh who uses any AI review of any kinds, um what do you think they should start doing differently tomorrow? Maybe change something in their agentic workflow or in their um in their review loop.

[00:09:02]>> For review loop, what they could do in the first place is to try to see what's the review effort which they can try to use.

[00:09:09]I wouldn't recommend everyone to use medium or I or XI in the first place.

[00:09:13]Try to gauge what's the thinking parameter which benefits your use case.

[00:09:18]>> Mhm.

[00:09:18]>> Because with each thinking parameter, the model is going to think a lot, and it's going to give a lot of output tokens which we have to pay for it.

[00:09:25]>> Yeah.

[00:09:25]>> So, just to try to make sure I get an idea about what effort parameter works for you. And once you get a idea about it, there are certain things which we can prompt it out based on Anthropic's prompt caching guide.

[00:09:37]>> Mhm.

[00:09:37]>> So, the first thing would be to find which thinking effort parameter would be beneficial then prompt them or tweak them out for according to their needs.

[00:09:45]>> Yeah, which part of like let's say take a code rabbit as an example for code rabbit part line, where would you say lower thinking modes are reasonable and where would you give higher or even extra high?

[00:09:56]>> For example, in terms of code rabbit space, there are some PR's or some files which could be very minimal, a read empty change or a test file change. So, for those kind of files I don't think we need higher reasoning. So, we can go with lower reasonings.

[00:10:11]If there is a change in a file which could be complex, then we can steer the model towards higher reasoning.

[00:10:16]>> Okay, is that something that we do with the adaptive reasoning or you set those in the workflow?

[00:10:21]>> We have those in the workflow, but those would be configured based on our internal system which determines or gauges the complexity of a file.

[00:10:28]>> Okay, I see. Okay. Yeah, very interesting.

[00:10:31]Um, let's talk lastly maybe about um, some more practical applications of the model itself. Uh, where would you see maybe some some problems that could arise when developers are using uh, Opus 4.8 maybe for coding or or other tasks?

[00:10:46]What would be a word of caution or maybe a a tip that you could give someone after the within reason of what you've tested so far?

[00:10:53]>> That's a good question.

[00:10:54]I'm not I couldn't recall anything that could be caution for a developer to use 4.8. This is very early to comment on that, but uh, I would recommend a developer to find a thinking level right after testing the use case. I would not recommend for them to try and with an I or XI as soon as possible.

[00:11:13]>> Yeah.

[00:11:13]>> It's better to get a sense of what the model does and just gauge them accordingly.

[00:11:17]>> Okay, okay. I guess we're also still experimenting with the right prompting modes and fine-tuning the system to it, yeah.

[00:11:23]That makes a lot of sense. Maybe last question just to wrap it up a little bit, what would be one thing that maybe has been that could be improved by the model but based on your testing and it would tell you, "Oh, this is uh that that's the last thing that I would need to flip it on in uh in terms of using it at Code Rabbit immediately."

[00:11:40]>> Um last thing I would do to turn on the model is in our device we found that model sometimes thinks a lot and it spits out kind of output tokens. So, >> Which is costly.

[00:11:51]>> Which is costly. We have to pay for it.

[00:11:53]So, I have to try with various prompt which they have shared with us the from the prompt guide.

[00:12:00]After experiment with, so that would be one thing which we have to try before releasing the model out to production.

[00:12:04]>> Mhm.

[00:12:04]>> But I'm excited to see how the model performs in the wild because benchmarks are internal. So, I want to see how it works in the wild with customers and how do they feel about the model and what are the other use cases that we can use for this models.

[00:12:17]>> Yeah, okay, great. Well, that wraps up our little series here. Thank you so much for coming on, Goutham.

[00:12:22]>> Thanks for inviting me. It was a pleasure to be in this podcast. Thank you.

[00:12:26]>> [laughter]

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01