This benchmark provides a much-needed empirical baseline, proving that the reliability gap between frontier and budget models remains a critical bottleneck for production-grade front-end automation. It effectively replaces anecdotal hype with rigorous data, offering developers a clear roadmap for model selection based on actual performance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Benchmark of 12 LLMs on React/Typescript: 7 Tests with PlaywrightAdded:
Hello guys. In this video I will show you the benchmark of front end task with react and typescript on 12 different LLMs. This is based on this comment I received from Dmytro who pointed out rightfully so that most of my benchmarks and tests were for Laravel projects on back end tasks. Fair enough. I went out of my comfort zone and tested the models on JavaScript task. Dmytro was suggesting a Figma design, but it's not really testable automatically. So how do I test if it's pixel perfect with JavaScript? Maybe it is possible, but too complicated. I want a bit more simple. Let me show you. So this is the new benchmark table blurred out for now, but I will show the results reveal in this video. And this video will have two versions, free and paid. And free version will show you the results or how many mistakes were made by each LLM and which performed the best or which ones.
And then in the paid video I will talk about price where I compared also the API pricing of Opus and GPT and Cursor Composer 2.5. So this is for premium members of AI coding daily. Here's the link and this will be 12-minute video.
And also in that video I will show you the updated LLM leaderboard which looked like this on May 15th. This is my tweet with 15 points max. Now I added five more and what will be the leaderboard after 20 tasks. So let's dive in. So this was the test react and typescript application with main app.tsx with those components not implemented. So seven different scenarios, different components with routes and everything.
So code for the project is prepared and then the prompt was to implement those components. And the prompt is pretty detailed with each scenario described in kind of a technical way, but missing details for some implementation. Some tests, I admit, were too detailed and too easy, so the models wouldn't need to think, but some were pretty vague with just the formatting rules, for example, without really showing the model what should be the code. So, I launched this prompt on 12 LLMs. And then there were Playwright tests, seven specific tests.
This is one case. This is another case, and so on. So, basically, checking if the behavior is correct. And that models didn't see the test, so the tests were hidden until the prompt is executed, then I copy the test into the folder, launch them, and see if all seven pass.
And as a result, I was getting the errors something like this in Playwright, or this similar in the terminal. Later in the test, I made a script for myself to speed it up. So, yeah, although the tasks were about pretty basic React and TypeScript, and really well-defined in the prompt, totally not all the models received a perfect score. And perfect score meant, by the way, five launches. So, I launched each prompt on each model five times. So, these are the results, revealed, unblurred, and you see red colors where LLM made at least one mistake in at least one of those seven cases. And there are a few patterns we can discuss here. First, Western models, so to speak, Opus, Sonnet, and GPT and Gemini made no mistakes in any of five attempts. So, clear pattern and clear kind of winners. Cursor Composer made one mistake out of five. But in other words, this test also proved that the market leaders, the frontier models, are a bit better than Chinese models or other cheaper models in most cases, so this was one of the cases where it was clearly visible. No mistakes at all. And then if we compare cheaper models or Chinese models, also there kind of levels. So, Keep me and Mimo time and time again, they prove themselves as better than others. So, both of them made only one mistake out of five attempts, which is much better than three mistakes by Deep Seek version 4 Pro, Pro not Flash. So, GLM made also three mistakes and then the worst ones as expected. Kind of this is also that I see in multiple benchmarks, how bad are Minimax models and also Quen. Some of you in the comments point out that the results depend on the harnesses and I agree with that. I test all of those in open code to avoid subscriptions or putting the credit card in those providers directly. So, open code is my way to access them and pay only one provider, which is open code. In theory, the same models in the harnesses and agents of those providers could be better. But currently, I'm not planning to test those scenarios and also I know a lot of people who would do the same thing, use open code to access Chinese models. From what I see online, this is actually one of the most typical use case of open code in general. So yeah, this is one kind of conclusion, the trend between these models and those models. So yeah, this is the end of the free version of this video here on YouTube. Pretty clear winners, I would say, or family of winners, you could say. And the full 12-minute video where I compare the price, the speed and also the updated benchmark from four projects all together with 20 points max. This is in the premium video for AI Coding Daily members and by purchasing the premium membership, you're actually supporting me to continue with those benchmarks, tests and comparison videos both on premium and here on YouTube. That's it for this time and see you guys in other videos.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsโข2026-05-28
How agent o11y differs from traditional o11y โ Phil Hetzel, Braintrust
aiDotEngineer
450 viewsโข2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation๐ฏโ
LearnwithSahera
1K viewsโข2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsโข2026-05-29
Search Algorithms Explained in 60 Seconds! ๐ค๐จ
samarthtuliofficial
218 viewsโข2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsโข2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsโข2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 viewsโข2026-06-01











