Effective chess AI systems require separating concerns: classical chess engines like Stockfish handle position evaluation and tactical detection, while LLMs serve only as translators to explain these structured signals in natural language. This hybrid approach achieves sub-3-second latency for consumer applications by preventing LLMs from reasoning independently, which would cause hallucinations. The system uses detectors for forks, pins, and positional themes, with user feedback triggering autonomous agents to improve the pipeline through iterative refinement.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take TakeAdded:
Afternoon, everyone. So, our next talk will be something a little bit different. We're going to dive into the world of chess. Quick show of hands, who has heard of Magnus Carlsen?
Okay, fantastic. No introduction needed, but widely considered the best chess player in the world. He also founded a company called Play Magnus.
Uh this is where myself, Ananth, and my colleague Asbjørn currently work at. And we're going to talk to you today about how we built our AI chess coach that now you can use and is in production.
So, first up, quick agenda. We'll quickly discuss a bit more about Play Magnus, what it is we actually built, what it is we actually launched. Uh Asbjørn will then go into a quick history of chess and AI. A lot of links there. We'll briefly touch on why LLMs are actually bad at chess and how we managed to solve this problem.
We're then going to deep dive into actually understanding our game review and sort of closing the loop with our autonomous agent. And you'll get a demo.
And then finally, some latency versus quality trade-offs, as this is a consumer-focused AI application. And then lastly, some learnings.
So, first up, what is Play Magnus? In its simplest form today, it's currently an iOS and Android application. You can go on and play your friends, and you can post about your games.
What's relevant for our particular talk is that after you play a game, you get presented with our game review. And this is powered by our AI pipeline.
So, for example, just showing you how it works. In this particular position, it's leading to a checkmate. The last move that white has played has moved the knight from this yellow square over here on F3, captured the pawn on E5.
It is a brilliant move, so automatically gets the brilliant sort of notation. And the commentary below is actually generated by our system. And we're using an LLM and the pipeline we'll get into in a second. But what's quite interesting about it is we're able to give you the nuance of why it is a tactic, what detectors from a positional and tactical sense have fired, what are the threats you're trying to do, and actually explain sort of the why behind the move. So, that's the system we're going to be talking about today.
Finally, on the last little step of our of our application, we've started revealing insights about your play. And this could be things like how accurate you played in a particular game phase, maybe your current rating, or your current depth in a particular opening.
And these insights form the next layer of analysis that we present to the coach, who then gives them back to you as opportunities for learning and improving. We hope by using this, you'll be able to improve and become better at the game.
All right. So, first a brief history of chess and AI since they've been intertwined for so long, just to give you a little bit of backstory. 1949, Claude Shannon, the OG Claude, wrote the the paper programming a computer to play chess. And here he envisioned that or he proposed that there there are two types of of chess engines, type A and type B.
Uh type A were these brute force engines that search through all possible possible moves and figure out the best move. While type B were those who we know from from 2017 and and onward that can selectively pick out the the best moves.
Back then, he assumed that we would need type B computers to to play chess because computers were so weak back then, you couldn't search through the whole whole tree of of moves. But computers quickly became better, and people just started scaling these type A computers.
Uh they got better and better until they in 1997, Deep Blue versus Kasparov, uh the first time a chess engine beat the best chess player at the time.
Uh so, people didn't really bother about these type B computers for a while, these uh intuitive uh engines, until DeepMind, shout out to DeepMind, uh released first AlphaGo, because Go is a much more complex game than chess. So, you you can't solve this with these type A computers. You would need this intuitive approach, neural network approach that actually selectively uh figure out which lines to to calculate.
Uh but after that, they released AlphaZero, who could play not only Go, but also chess and and shogi.
Uh and some some years later, uh LLMs came, and people started playing chess against the LLMs, and quickly turned out that they can't really play chess.
Uh sometimes they they make some right moves, and they they can to an extent uh play play nice opening, but they quickly start to hallucinate.
So, let's see if we can show the Yeah, there's a video of Grok went to Yeah, we see Poison Pawn line with Queen B6 early on and lost pretty badly, not necessarily because of the opening, but because it doesn't really know how to play chess.
That that was Magnus Carlsen commenting a LLM chess tournament from our office in in Oslo.
Um that was a tournament organized by Kaggle when they launched their game arena, which was a benchmark for benchmarking LLMs when playing different types of games.
One of them was chess, and now they've started to to add more games. Also added werewolf recently, where you can watch LLMs try to deceive each other in in social deduction games, which is I can recommend watching.
But yeah, we see that LLMs often uh hallucinate because obviously they're trained on language, they're not they can't calculate.
Uh they can't like high reasoning models can to an extent calculate through the reasoning steps where they can actually play out moves, but they quickly uh fall apart. Um but there's nothing inherently wrong about the architecture of the like the transformer architectures to play chess. DeepMind has trained a transformer to instead of predicting the next token, they predict the evaluation based on a chess position, where they've trained it on millions of chess positions to uh Stockfish evaluations pair. And that has actually led the transformer to to play at a grandmaster level strength. But these aren't trained on language, so these can't explain chess. So, how do we uh bridge the gap between these old chess computers that can understand and play really good chess between the LLMs that can explain chess?
So, we're going to go through our pipeline of how our game review explains chess in our app. When you play a game, first thing we do is we run Stockfish through the whole game. Stockfish is the leading chess engine now that's like a classical chess engine that that uh calculates the best move.
So, it's what Stockfish says is considered to be the solution in a in a chess position.
We then extract a lot of uh context in the position cuz we want to explain not only the best move, we want to explain the threats, the plans, um the tactics that could arise in the position, what you should have played, a lot of these nuances that is useful when if you want to learn how to become better at chess.
Um so, we have a lot of detectors that tries to figure out all all of this, the forks, pins, skewers, uh positional structural themes.
Doubled pawns, for example, is a disadvantage, so we need to be aware of all of those kind of things.
And there's also a new novel chess engine called Maya, uh which is behind a research project by the University of Toronto, uh where instead of building a chess engine that is trained to play the best, they have trained a chess engine, uh it's a neural network that predicts the moves that humans would play in certain positions. So, given a rating, for example, an online rating of 1,500, it outputs the probability distribution over all the moves in the position.
And by doing this, we we could actually say that a move is a move is really it's it's the best move. We know that because of Stockfish, but you also know it's really hard to find that move because the probability of playing it at certain levels are are so low.
And all of this information, we feed that to the LLM, and that the LLMs for now, the LLM's job is only to translate this information uh into English, because we really don't want it to try to figure out too much on its own, because it quickly leads to hallucination. It still does, but we want everything to be grounded in the information that we uh give it.
And that could result into a comment like this. If you play chess or know about chess, this is a game that I played. My opponent played F5 here, uh which is a bad move. So, by using Stockfish, you could see that you get like a bad move indicator.
Um but that's not so useful to just know it's a bad move. So, we are running our detectors to figure out that, "Okay, F5 is threatening to trap my queen." You can also see it draws a a line with Bishop G5. But it can also say, "While it threatens to to to trap the queen, I can just capture the pawn in the middle, because that's defense this square so that my queen can get out of the situation."
Mm.
So, that's how we get to that situation.
Now I'm going to explain a bit on how we improve our our game review using agents. We have a We have closed the loop from user feedback to the public request essentially with humans in the in the loop, but what happens when users download the commentary in your app because you can download it if you think it's it's bad.
It posts it to Slack, but it also sends it to Cloud Code channel. Channel is a new feature in the research preview that is essentially an MCP server that can inject events into a running Cloud Code session. So kind of like Open Claw if you use that. So you have this continuously running channel and you can inject events to it. So and then Cloud Code starts working or on the on the commentary. It gets all the information.
It runs a commentary triage skill that we created that outlines its process how it should go about to investigate what's going what's wrong in the position. It has some scripts to actually run the generation so it can modify for example the prompt. It could change some of the detectors, create some new detectors and then it can generate the commentary again given this new information and verify its own work.
And then it will also ask questions back to Slack so that I could be on the bus and I could get a message from Cloud who is working on this problem where it will ask me this seems right and I can guide it.
And if it if it looks right, I'll just tell it to submit the PR and I open GitHub on my mobile.
It works works fine and and I merge it.
I'm going to show how this works by So we have a running Cloud Code Cloud channel here. Here is the check the the Slack channel where the commentary appears. This is just me having tested a bunch of time. I'm going to open up the app on my phone.
Go to a comment and report it as as bad.
Now we see it posts a a comment. We can see the position. The commentary that was generated is there.
Now I haven't really looked at the commentary so it could be it's it's it's probably it's probably good. But we can also see that it injects it to the Cloud channel who invokes the commentary triage skill and starts working. Now this is now running on high effort so this could take a while so I'm thinking we should just go to the next slide and then we could get back to it to see if it is something is happening.
Fantastic. So we'll come back to that in a few seconds. So as we built this for you know end users, we had to really kind of consider this trade-off between latency versus quality. So typically when you finish a chess game, you want to get the analysis and the results pretty quick. You want to cycle through the moves kind of one by one. So we really couldn't show you like a coach is thinking screen you know indefinitely while reasoning tokens are kind of running in the background as an example.
So we had to get this done which felt almost instant. In AI world, that's a few seconds at at best. So we we're aiming for sub 3 seconds to generate our coach sort of feedback.
How do we do this? We use Gemini 3 flash. Time to first token is typically being about a second. End to end latency on average is about 3 seconds which kind of meets our criteria. We have experimented with other reasoning models and we'll get into that on the on the next slide. The analysis is not incorrect so the quality is is definitely good, but the the challenge is it's unpredictable as to how long it's going to take to finish. So we have a new set of features kind of planned for a more you know chat with your coach type experience where we can kind of expect the user to be more patient and and wait for a response rather than in the sort of phase where it needs to be more instantaneous.
The the last thing about quality is Osborne and I are both a good chess players. So we ultimately kind of have the final say is when we look at the position, it's actually use how we would calculate and how we would play and compare it to the LLM's response. This allows us to actually evaluate whether it's doing the right thing or not.
So if we talk about evals in more in more detail, like I said Gemini flash is kind of our our benchmark, but we have multiple chess scenarios. Currently we have 16 different scenarios that we created. These are around themes like tactical patterns, blunders, and sort of limiting hallucination. So as an example, you know there might be a knight fork on the particular chess position and we're trying to assert that can be LLM actually understand and mention this when we run it through with our sort of context engine. And how do we do this? We extract scenarios from real games. We use LLM as a judge, very powerful sort of technique that to test.
We then run the model in Open Router. Open Router has come in handy because new models are being released you know so fast so frequently, we just want to be able to quickly swap in and swap out maybe new version of the Gemini. We want to check check out the latest GPT-5 model or one of the Cloud models. So we'll then compare and and see sort of the quality ultimately relying on our own skill to detect whether this is good or or bad. And as a sort of final point on this, we we ran all three models Gemini, Cloud, and and GPT-5 and you know typically Gemini flash is about 75%. It still doesn't pass all the the scenarios we've set it so we're always kind of seeing if a new model will actually exceed some of the the tricky cases we've set up.
Cloud on more thinking gets us to about just under 60% but the latency is much longer. GPT-5 mini giving us a smaller set of model, lower latency or other slower latency but lower accuracy as well. So we kind of continuously run through these to to update.
Last thing on sort of our our learnings and how this can sort of apply to to your world sitting in front of us.
Number one, you really important to separate that sort of data pipeline from the language generation. LLMs can do a lot of different things but if you need you know high latency or quick latency, it's a good sort of technique. Really try to close the loop with autonomous agents. Kind of the flow that Osborne showed is now very common and very powerful and really allows you to iterate quickly. Always try to build a very a clear sort of context extraction model. This unfortunately in the beginning is a very slow sort of painful process. It's a large you know ultimately JSON file that you keep sort of starting big and you start to prune step by step and you see how quality improves over time. Automated evals really do help and I I'm hoping in your domains you also have a set of you know SMEs that you can help rely on to evaluate the output and sometimes that's not necessarily the person actually building it. Could be someone else who is a domain expert. So remember to sort of partner if needed.
The last thing just on the on a fun sort of note before we go back to the the output of the the sort of coding agent, we do have some chess sets on the the third floor at the entrance you may have seen. We're going to host a chess simul today in the afternoon around 3:45 p.m.
A chess simul for those who are unfamiliar is when one person in this case me or Osborne plays multiple people at the same time. So we have four chess sets. We will play four people at the same time. We'll have a slightly more time for us cuz we have to walk around to play multiple boards. If you happen to play and you happen to win, you will get one of the wooden chess boards at the end of the event. They're very nice high quality chess sets. If no one wins, we will still determine who the two best players are and we will still give you a set. And if everyone wins, we need to buy more boards. That's that's Hopefully not everyone wins. There's a QR code if you want to sign up or you just stop by 3:45. You're welcome to to do that. And then yeah. Yeah, close it off. Let's go back to see if what has been happening here. Oh, we see it's still thinking. It is It has actually added a comment looking into this now investing in the position.
Quick question. What specifically feels wrong about the commentary? Yeah, that's it got me there. It's there's nothing wrong.
You're absolutely right. Nothing wrong.
So yeah, it's Now it's going to close this off cuz yeah, it it worked well.
Fantastic. Well, thank you so much and happy to take any questions.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











