While LLMs show high functional pass rates (80%+), they generate code with significant quality issues including high cyclomatic complexity, bug density (600+ per million lines), and security vulnerabilities (300+ per million lines); Sonar's ACDC framework (guide, verify, solve) addresses these challenges through context augmentation, pre-commit analysis, and automated remediation agents that verify fixes before presenting them.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, SonarAdded:
[music] >> All right. Okay.
Sorry guys for the little hiccup.
Okay. So, my name is Prasanjit Sarkar and today's session is all about is our all the LLMs are they generating the code which is enterprise ready, right?
So, let's look at the first slide. In the first slide we are talking about um Adnan Qureshi who 2 months back said that a lot of things has been changed in the software development area. Earlier we used to write code in IDE. Now things has been changed. Now it's all about agentic. So, you are spinning up agents.
You are just giving instructions in English and English is now the new programming language. Everybody's talking about that. And then you are letting it go, but you know, humans are actually reviewing the code that is being generated by the agent.
So, a lot of things has been changed.
Earlier we used to start you know, opening up by IDE, fancy IDEs like starting from VS Code or JetBrains to Cursor now or Wind Surf or you know, anti-gravity.
And now we are moving towards the agentic coding platforms which is Code X or Claude or Devin or Gemini CLI. And according to the Pragmatic Engineer Survey which was done in March 2026, we have seen that 55% of the developers are now using regularly some of the AI agents, right?
But the question is do you trust the code that is being generated by these LLMs, right?
Is it maintainable? Is it secure? Is it readable?
And and those are the questions that we are going to you know, debunk right now in this session.
So, let's look at why let's look at you know, by evaluating those models and what they are generating out of the box.
So, there are two aspects that we are saying. One aspect is all the LLM leader board all the LLM leaders, all the LLM companies, they are saying that okay, my pass rate is 80 80 plus plus percentage.
84% 83% 82%. Those are you know, eval coming from human eval MBPP SWE bench.
Those are those are fine. Those are the functional correctness on the test cases which is mostly known for. But what we are missing is the security aspects, the real world reliability aspect, the engineering architectural problems, the engineering discipline that you have. So, code maintainability and the tech debt that is going to be generated by the LLMs itself and then the context our analysis. So, these are things which are missing.
Now, what Sonar has done, we have created an evaluation framework that eval framework run through 4,444 plus distinct Java programming assignments. It's an open source data sets. We took up We took up the assignments and then we run through models. Now, when you run through the models, we saw a huge amount of you know, data that is coming out of the analysis and that is something that we have open source not open source. I mean, we have actually you know, put it into open world.
So, we did the analysis using the SonarQube Enterprise and we got that critical insights to choose the right LLM. Now, let's look at the right LLMs or probably not.
Let's look at the LLMs that we have evaluated. So, here I'm showing you just about the five LLMs.
And here you can see that Gemini 3.1 Pro High the pass rate is coming from the SWE bench. So, you can see that 84.17%.
But it is verbose. So, those 4,444 Java assignments that I talked about, that to solve that problem, we have seen that it is creating 307,000 line of code, right? Which is pretty concise.
It's not that bad.
We have seen the complexity cyclomatic complexity is 234. It's really really buggy as well which is you know, 614 bugs that we found out per million line of code.
And obviously we have the security issues per million line of code which is 210.
So, you see that although these models are generating the code, although these models are pretty much pretty higher high in models from the foundation models, but you see that for an example, I you know, Gemini 3 Pro is creating the highest sorry, the Claude Sonnet 4.6 is creating the highest risk. 300 security issues per million line of code that we have seen, right?
It is also high blow. So, for those issue for those number of assignments, we see 627,000 line of code which is being generated by you know, Claude Sonnet 4.6. And you will be stunned if you look at the you know, GPT 5.4 and GPT 5.4 Pro High model, you will see that 1.2 million line of code being generated for those 4,000 plus Java assignments. That's a huge amount of line of code that is creating, right?
That's a high blow.
>> [snorts] >> Now, why it is happening? Well, you know, we have seen the mixed quality code. So, the training sets that you see, the training sets actually have the mixed quality code coming from open source, coming from some other places and that is actually creating the problem as well. A little bit of problem. Then the built-in security flaws. So, the data sets that you are using to train the model that has inbuilt security flaws and that we have seen where the models is picking up those insecure code examples along with the good examples as well.
Then there are hidden bugs in the data.
So, there are subtle logic errors that slips into the training pool and that is actually causing your models to produce the code which fails or you know, misbehave in a different way.
And of course the LLMs themselves, right? So, LLMs are probabilistic, right? So, obviously we know that the prompt that you are that giving to one model today, tomorrow when you give the same prompt to the same model, it is not going to generate the same code. It is going to code It is going to create a different amount of code, a different set of code, right?
It does have the limited context which is obviously doesn't understand the company's data or company's code base or company's architecture.
And obviously it is not explainable. So, it is very hard to diagnose and improve when it is generating the code.
So, we created this leader board called sonar.com/leaderboard.
Here we have given all the data about all the different models that we have evaluated. So far we have 53 plus models and all the I mean, different versions So, you see Gemini 3 Pro High, Gemini 3 Pro. So, different you know, combination of the thinking aspect as well. So, we evaluated 53 plus models and we open sourced all of the data to openly for the people to see that how the models are now behaving in a certain way. So, as of now you see that Gemini 3.1 Pro High that's was that was evaluated February 19th and that has a highest pass rate which is 84.17. Not that bad of the issued density as well and the lines of code cyclomatic complexity and cognitive complexity is also fine. It's not that bad.
But yeah, this is the this is the leader board that we have created where we are creating we are evaluating all the different models that is coming up continuously and then we are evaluating that and we are uploading the data. So, you can see not only that when you go to each and every model inside, there are lot more details that we have provided that what exactly are they doing. So, that you can take a concise decision about whether you are going to take this model or the other models you know, according to your architecture.
So, you see this the key inside which is Gemini 3.1 Pro High is 84.17 correctness. That is the functional correctness I'm talking about.
And that's you know, accuracy leader, but you have then other models which are you know, five models that we have given which is crossing the 80 plus percentage of accuracy and these are kind of leaders that we have.
So, we talked about two different complexity. One is cognitive complexity, one is cyclomatic complexity.
So, cyclomatic complexity is how many branches do you have? Like how many ifs and ifs and else, how many you know, how many for loop, how many ifs and other loops, how many while loop that you have. And the cognitive complexity is a Sonar proprietary one where we measure that how difficult a code is for a human being to read and understand and maintain that code, right? So, these are the two different complexity that we maintain.
And if you look at the models and the kind of data, you will see that the amount of verbosity that we have seen.
So, the newer models that we are seeing coming up you know, day by day, we are seeing the kind of lines of code is going to the north. If you see the GPT 5.2 High, it has created actually a million line of code for those 4,400 plus Java assignments, right? And if you see the earlier models like GPT 4.0, that's less than 250,000 line of code, but the model which has going up north, the number of line of code is you know, being written is too high.
You have seen the models which are also going higher up that also have the higher complexity and higher cyclomatic and cognitive complexity.
You also need to see that the number of total bugs per model that is also going high, but what we have seen is that the models which are getting matured enough day by day, they are getting kind of finer bugs or finer security issues rather than the old issues. So they're doing a good job in terms of the running the reinforcement learning and they're securing the problems that they have seen already, but then doing that they're also creating some more finer bugs that is very very hard for a human being to detect. We have seen the total vulnerabilities per model also is now decreasing, but then the amount the kind of vulnerabilities that we have seen is is going in a different different genre.
Okay, so yeah, we have seen that it's generating the kind of code that doesn't meet your you know engineering standards, but what can we do?
So in this slide we talked about the agent-centric development cycle. We call it as ACDC.
So Sonar has That's a funny name, yeah. So it's called ACDC framework. So in the ACDC framework we have three stages. We have guide stage, we have verify stage and we have solve phase. So in this one we have our inner loop and we have an outer loop. So in the guide phase we have introduced two different product.
One is called Sonar context augmentation and Sonar sweep which is in a private beta. So Sonar sweep is basically treating the data that you actually have and and the data that you are actually using to train your model. So if you have the problematic data, that means that your model is going to create the problematic code. If I treat the data right there itself, the code which is going to be generated is going to be good enough, right? Context augmentation is going to push the context all the entire code base into the LLM itself.
Then we have the verify stage. Verify stage is Sonar cube.
We have various different ways of you know utilizing that. We have introduced Sonar cube agentic analysis which is in beta right now open beta, anybody can participate in that, which is actually taking your code in the runtime. What does it mean is that you're using a cloud or codex or Gemini CLI whatever which have an MCP inbuilt and then you can say that, hey, generate this code.
Now before you commit that, before you push that into PR, you just need to analyze my code. So it is going to analyze this code way before the your CI runs, right? The CI runs takes about 1 to 5 minutes and then this analysis is going to take about 1 to 5 seconds.
Within 1 to 5 seconds the code which is being generated right now, before even commit, it will analyze and it will tell you that, oh, these are the problems that I found. Fair enough, it will be pushed down to the agent and agent is going to fix that problem right there before you even commit that. And then you commit, then you push that code back to the PR and the PR analysis is going to run. Now the solve part is where we have introduced the Sonar cube remediation agent. So let's say that even doing after all of that, if there are issues that has been slipped through your verify stage and it has gone back to the PR stage, right? So you committed the code, you push a PR and then you found out the Sonar cube found out there are issues that we found that that's there and your quality gets fails. If that fails, then the remediation agent is right there. It's right now on its open beta where you can just click and then say that, okay, I want to fix all of the issues that is there in the PR. Not only that, let's say that you have a tech debt, right?
Huge amount of tech debt. So you go to the Sonar cube dashboard and you see these all those tech debts that you have, you just click and select all of the issues that you want to select and fix and then say that, assign it to agent. And we are going to create each and every PR per issue and we are going to fix that one, giving it back to the developers. Developers are going to review that. If they find if they're happy, they're going to you know approve it and then merge it.
The beautiful thing that we have built for the remediation agent is that the remediation agent is going to create the fix, run it through the analysis again, run it through the compilation part again and see whether that is creating any issues or not. If there is an issues, it is going to discard it. We are not going to give you the code which is going to you know create a regression. We don't do that, right? So that's a kind of a verify loop that we run through.
So yeah, this is this is the entire you know I mean I would say that the product uh This is our product portfolio where we are providing from guide and to verify and then solve.
We have all these you know 40 plus programming language and framework. All the devops are you know supported IDEs you know we are partnering with this. We have we are in the marketplace as well.
So yeah, that's that's how we are solving the the issues that you are seeing where the LLMs are generating the code but we are not trusting that, right?
Yeah, so if you want some more info, we are in the expo booth.
Come and visit us and maybe we can show you one or two demo as well for the product that we have built.
Right, thank you. [applause] >> [music]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03











