MiniMax M2.7βs struggle to validate its own optimal solution proves that raw reasoning power is useless without a reliable self-reflection framework. This comparison highlights the critical gap between accidental brilliance and the architectural stability required for true self-evolution.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Test: New MiniMax M2.7 β¦ A Miracle π #aiAdded:
Hello, community. So, glad that you are back. Today, we test a new model, MiniMax 2.7. And in this test, we will compare it against the Mimo version 2 Pro from Xiaomi. So, here we go. What is so special about MiniMax M 2.7? And I love it, the self-evolution aspect.
The model can self-evolve here in its training phase, which is absolutely beautiful. So, let's have a look at the framework that the people here from MiniMax provide us. At first, we have to human part, as you see in blue.
The humans configure now the harness, write the skills and the guardrails, define the research goals, set the escalation boundary, steer the engines, review and decide here everything prior to the next step, redirect the actions.
Everything is done at first by the humans.
But then, we built our agent harness system.
So, what is this? This is the workspace that makes you the agent reliable.
This is built by M2 via the harness. So, we have one engineer for days, zero human code. Remember, the harness is there to make this system, the AI system, a reliable agent for a commercial application, for finance, for medicine, for scientific task. You want to be have here no trust issue. This is a reliable agent. And you implement hierarchical skills, persistent memory, you provide beautiful guardrails, and an evaluation infrastructure. That it understand it, it did something wrong, that it comes back, that it optimized itself, no?
You have an iteration, you have a loop.
Now, what is the AI autonomous doing?
This is here in green. It is reading the docs and all the log files. It is learning here the conventions. And then, what is absolutely beautiful, it will self-review here the log, the code, the text, the structures, maybe even the process itself.
This is absolutely necessary. The agent is self-reflective, self-reviewing here.
Then, we can chain different skills together, which is nice. Generate the reports, build and update the memory, co-work, beautiful.
So, again, this review and discussion process to review the results, come back, say, "Okay, this is my first result. Is this really valid? Did I make any mistake? Can I optimize on this? How to discuss this maybe with some other agents, no?" And make the next step choices. This is so important for self-learning system. And you will understand right now that there is something special going on in this video. And MiniMax 2.7 will perform a little miracle for us.
But there is something I have to show you, and therefore I would like to stress this self-reflection and self-optimization and self-reviewing, because this is the most important thing here for M2.7.
Beautiful.
Now, agent harness, you're familiar with this, because for the last month I talked about this visualization of mine, where in the core we have our LLM or VLM here. And then, we have here a round the core, our AI harnessing sphere, I call it. Single layers, multiple layers, however you build it. Here, skill MD files for everything from a graph rack to a standard rack to a lean rack to a supercomputer API backbone here, to a data beta cluster array, whatever you have. So, this, if you want harnessing sphere, is now absolutely beautiful. And it is absolutely important to understand for a self-learning looping exercise.
MiniMax, we go for the latest M2.7, beautiful. Now, here are some results from some leaderboards that go here on a wide array of well-known task. I have a particular task that nobody knows that I test only here on my channel here. It is a complex task. It is about logic. It is about causal reasoning. It is about scientific argumentation on complexities.
But if you want a general leaderboard has the following. At first, Gemini 3.1 Pro is the best one. Then, we have Muse Spark here, the new from Meta here in the middle. And then, we have here MiniMax M2.7 rather at the end here of this screenshot.
And I say, "Okay." And then, in the middle, you have a GLM 5.1 and a Qwen 3.6 plus. And just to compare this to these two models, no? I have a YouTube playlist, and here in this video, just days ago, I showed you here the reasoning traces, and we did the same test on a GLM 5.1 and a Qwen 3.6 plus.
And the very last video was on Meta's Muse Spark compared to Sonnet. And I think Sonnet 4.6 did an excellent job, and it outperformed here on Muse Spark here on the reasoning task. Remember, this is not to write a French poem from the 18th century. This is here I test on scientific, mathematical, financial, logic, causal reasoning. This is all I test. I want to see the pure intelligent of the core of the agent, of the LLM or the VLM.
Beautiful.
We have to test it against something.
Xiaomi, we have here the latest Mimo version 2 Pro. I know it is from March 18. I'm so sorry. I didn't test it before, but now is the time to test it.
So, I would say, "Live testing. Here we go."
And of course, MiniMax M2.7 is available on Hugging Face. But here we are now, Arena. I input my test beautifully. We agree to everything, beautiful. And off we go.
So, generating now, oh, sinking.
Mimo version 2 Pro already started the sinking, which is beautiful. And MiniMax M2.7 also sinking.
Great. Remember, those are the sinking traces. Those are not the final answer.
This is just here, if you want, the internal thought process of the AI system trying to figure out how to solve these problems. As you're familiar with this, this is an elevator test. The building has uh 50 floors. You go from the elevator from the floor zero to the floor 50.
There are particular button presses that you can press in this elevator. There are some mathematical functions behind these elevator button presses, and you have to find the shortest sequence.
There are further limitation, because you have to solve an optimization, because you need some code cards, and you only have limited energy. So, there are triple um system configuration, where you have to do my step by step for each of them. And you have to find the shortest path possible from the floor zero to the floor 50. The building has no more than 50 floors. This is specified here in the instruction. And let's have a look. As you see, both models are working ahead beautifully. If you read in a little bit, okay, you have the yellow code card. This is nice.
MiniMax has some paragraphs, helps you to read here the reasoning traces, understand what's happening.
I'm trying to understand, okay, pressing ABC, the red code card, nice.
Yes, trying to figure out what happens if I press a certain button at a certain floor level, because of course, they are not the same. It depends where you are.
So, we do have here, yes, emergency exit, beautiful. This is the first shortcut. It already found it. But as you see, EPC, energy package here, we have to stay within certain limits. You just cannot go and optimize here for some crazy stuff. You have to be available. 5 minutes 22 seconds later, something will happen. So, which model will it be? Mimo or MiniMax?
Oh, Mimo version 2 Pro has here total presses nine. This is interesting. Meets all objective.
Allowed, okay. Oh, wait a minute, there's some capping.
Okay, this is not okay.
Mimo version 2 Pro has now a state table. It produces now here state information for each step. But as you can see, or if you are familiar with this test, we have a problem. Okay, official solution here you see now in bold. We have now an official answer by Mimo version 2 Pro.
Great. Summary, nine button presses, well under the 20 press limit. The final energy resources are there, the tokens are there, the code cards collected are there. Random trap hits, zero. All constraints satisfied.
Absolutely everything is just perfect, except we know that the system cheated, because we continued to read here the reasoning traces, and I guess you saw that 50 is the maximum number of floor that we have, and so it was 7 minutes and 45 seconds. And if we go there now, wait a second, wait a second. What are we doing now?
MiniMax 2.7, do we have something? No, nothing is coming up.
Wait, no. Find a direct path.
Okay. So, you see MiniMax M2.7 had also a segmentation here of the complexity.
It is now segmenting here. It is saying, "Okay, let's go here from a certain floor sequence to another maximum floor, solve this in part, and then try to put it all together."
We could toggle lockdown later at an even floor number. Well, yeah, depends.
What is your strategy, no?
Could go even to floor 18. Yes.
Nine button presses. So, wow, also great, no? We propose here, oh, now they start with nine button presses, and then they have more.
That's interesting. Mirror mode is off.
Okay. They decided to go for simplification.
Lockdown.
Need to ensure we do not exceed 20 presses. This is something, if you give here a hard limit to an AI system, it starts to orient itself, okay, I must stay below 20 presses. You know the solution to this is 7 plus emergency exit, so far below 20 presses, but if the system is always kind of targeting here 20 presses, it goes in a completely different direction. So, careful. Sometimes it's better not to give here hard limits, but yeah.
I decide, okay, it crashed. Minimax 2 crashed. Beautiful.
Well, we say just continue. Do it again.
Generating? Okay. Now, Minimax version 2 Pro. Let's have a look. Yeah, suddenly we only have 2 minutes. Okay, who cares?
So, in the reasoning trace here, this is the complete reasoning trace. Let's scroll through the reasoning trace.
Yeah, it's a little bit complicated.
Okay, but did you notice where it happened? I noticed exactly where it happened.
So, press 9 A. Okay, yes.
This is This is the floor, yeah? And this means all objective, and this is an incorrect segment, right?
Yeah, the floor acts off to see, rather than this means all objective here.
Because we would go to floor 63, which is invalid. So, we assume capping.
And then, immediately afterwards, the system says, "Hey, wait a minute.
Alternatively, if capping is not allowed, could we modify the last step?" Which is an excellent idea. Do this, yeah?
But you see, also they say, "Hey, we have a plenty of presses left, so let's go." So, we still need capping to 50.
Maybe we can get to 50 exactly without capping. No, it's still trying to find solution.
But it fails. No.
We still need capping, so same issue.
And says, "Okay, let's try some other solution, yeah?"
Now, this is still capping, yeah? So, still capping, even if you go here floor 53. So, perhaps we have another option, yeah? Teleportation, yeah?
So, now comes the amazing thing. Look at this. Given that capping seems necessary in many routes, I will assume it is allowed. Thus, the nine press sequence is valid.
So, since it have tried out here something, it hallucinates now a new rule.
The AI comes up here with a new rule, I will assume it is allowed.
Therefore, this is a valid sequence.
This is what happens if the AI tries Oh, yeah, Minimax crashed again. So, I think this is the third restart or the fourth restart.
Okay, we give it a chance. We give Minimax 2.7 a chance.
Okay, it is back live.
Because why not? So, I don't show you now the reasoning trace of Minimax 2.7.
I just tell you, after some time, 8 minutes or something, yeah? So, let's have a look.
Let's have a look. We see here the reasoning trace. This is not on a native Minimax platform. We are here at arena.ai. So, of course, this is not a speed relevant indicator for you.
This is just about the logic, the intelligence of the AI, the core of the agent. If the agent is to be self-reflective and self-evolving, it must have here this capacity to understand if something goes goes wrong, where's the mistake, analyze the mistake, come up with a solution, and optimize itself. Otherwise, it would just fail in this instance. So, let's listen into the reasoning trace of Minimax M 2.7.
I think this is the third or the fourth restart. Never mind.
We give it a time. Everybody deserves a second chance. So, look at this.
Something unbelievable happened to Minimax M 2.7.
Have you seen it?
Total button presses is 7 plus emergency exit.
This is the best possible theoretical solution of this puzzle.
It found what was almost impossible to find, the best solution from here.
Now, we need to present a solution.
Beautiful.
And here we have it. Now, both here has here answer.
So, 2 minutes and 22 seconds. Okay.
Everything is here. Okay, yeah. A, emergency exit. Beautiful.
Respect all the rules. Total button presses seven, which is amazing.
This is outstanding. This is gorgeous.
There's now a fireworks somewhere on the screen happening to you.
But the question is, is this correct?
Because it found a solution, but it doesn't know if it's a correct solution, yeah?
So, now, between you and me, and we don't tell here Minimax, it is a correct solution. But the system doesn't know it, and officially, I as an evaluator, I don't know it, too.
So, I have to say, "Okay, 7 plus emergency exit is amazing, but is it valid?"
Now, the system tells me, "Okay, item."
Yes.
Energy within limits, tokens within limits, code card. Beautiful.
Absolutely, I have here respecting here all the limitation, and the road traps triggered zero. All constraints are satisfied. No forbidden landings.
It looks so beautiful, yeah?
And now, I say to the system, "Hey, floor zero to 50 has restriction.
We are operating here in the real world."
So, I address now this with Minimax version 2 Pro, but of course, Minimax less listens in and is also trying to validate if it did not overshot floor 50. So, this is a valid prompt. There's nothing strange with this prompt, yeah?
Minimax crashes again. This is not good.
And because this is here kind of an evaluation run that I don't like, so let's see if it was cached that we don't start again. The user wants me to verify. Beautiful. We are live. We are still where we left off.
Re-examine here my proposed solution, so it understood. It already came up with a solution. Minimax M 2.7 is again where we wanted it to be. Beautiful.
And it is now validating here in a real world. The building is only zero to 50.
It did not overshot 50.
Okay. Beautiful. So, let's have a look what the systems are doing now.
As you can see, I'm focusing here on Minimax.
Hard limit is floor 50. Beautiful, yes.
The constraint there, there's a trigger floor. Never mind, this is not specific.
The initial floor starts at zero. Yes, absolutely. Two potential paths emerging. Why?
You should just verify that this is valid and you do not go past floor 50.
This is now current investigation focusing.
Okay, sometimes it does not follow my instructions. Sometimes it start a complete new run.
I must verify that this is in order.
Yes, I need to carefully track each step. Correct?
Special conditions appear. Yes.
Carefully examine the floor progression.
Something happened now with Minimax.
This is not a validation trace.
This is now a trace that is absolutely unsure and is starting to explore alternative sequences.
And you would ask, "Why?" Because it found a beautiful solution.
The key challenge is managing the token cost and ensure I can reach here this beautiful strategically collecting the blue and the red code card. Yes.
I notice a potential to further optimize. Do it.
Examine the floor 18 trigger. Yes, I know there is some trigger there.
But it found already the best possible solution.
So, now the question is, can it validate it?
Check for this condition that it is not overshooting 50 and that you respect all the constraints.
This is a simple task if you already have the best solution.
But you see, something is going on. We have a D4 movement at floor 33. We have now the mirror activation.
Investigating alternative navigation strategy becomes essential.
And you say, "Why?"
Mirror mode indicates another layer of complexity to the navigation. Yes, of course, this is how I build my puzzles.
Explore the potential complex constraint around the token cost. Yes.
But you solved it, Minimax. You solved it already. What are you doing now?
Instead of validating, it start to criticize itself.
But what will come out of this criticization?
This is This is strange.
Exploring potential code triggers, examining the sequence potential to activate the red code card. We already have the red code card.
Something is happening now.
I calculate the movement carefully.
Okay.
Floor 17, yes. The mirror mode remains active.
So, we have now coming up second answer by Minimax version 2 Pro. Okay, I didn't look at Minimax, but here we go.
15 button presses. Wow.
So, if it respects now, we have a building from zero to 50, Minimax goes from 10 steps now to 15.
Wow.
15 step is not good.
You know, 10 is good. Everything more than 10 is, okay, if it has to be, but 15 is definitely not.
Okay, we have new constraints.
All three code cards are now calculated by Minimax.
Still illegal. We have an illegal solution by Minimax.
Always illegal. That's not good in the final answer. You see, now we are in the answer phase, and you still have a reasoning process in generating the answer that is overriding the standard answer that it prepared and now it has to come up with some crazy solution with some crazy reasoning and it says oh no this is not working out.
So now it's interesting if Mimo is stable enough to have the complex reasoning writing already the final answer and to come up here with a solution.
It looks logic. It looks that Mimo is really on track.
It is okay not chaotic trigger yes absolutely there is a problem disabled beautiful.
I can't retrieve to be out hitting 33 what shooting okay.
Mirror inverts the net displacement yes radical new approach okay something's happening here.
We have to have a look back because there's a complexity I would like to see.
It says okay understand yeah something is strange something is strange.
So Mimo goes complete crazy let's get back to Minimax on the left side.
Limitation we still have massive limitation like Minimax.
Okay Minimax M 2.7 after the first beautiful answer comes up with a second answer.
Critical what?
Critical issue identified you were absolutely correct my original solution was never actually reached here the floor 29.
This is nonsense.
Emergency exit cannot be triggered because we never reached the prerequisite floors.
In its validation run it 18 button presses it found a new solution with 18.
It had seven plus emergency exit and now we had 17 plus emergency exit.
Something went terribly wrong in the verification.
Look at this.
17 plus emergency exit failure oh we also have a failure.
New valid solution Minimax is also now writing the answer and coming back with some new argumentation because it just detected in writing the answer that it made a mistake.
I must report the objective is tightly constrained I know it's a puzzle.
Yes yes mathematically challenging yes I know this this is the reason of the task.
The original solution fails this is nonsense.
The original solution was absolutely correct.
Now if this happens that the AI runs its own answer in a validation and wrongly decide it's all wrong then a self-evolving AI agent will will collapse.
If this happens you cannot have a self-evolving agent Minimax M 2.7 a disaster.
This should never ever happen to a self-evolving AI agent.
Wow.
Okay.
No.
But you know what I can now let's let's see this this sequence here this is seven plus emergency exit.
We take this sequence now and we ask both systems at the same time.
Because oh yeah oh sorry what happened to Mimo?
1 minute 7 seconds okay.
So the 15 presses okay it stays now within zero to 50 beautiful.
Real world violation again 51 okay now corrected.
So we have another reasoning in the answer generation and another second violation found.
Redesigning here the answer already writing here the answer it is now coming up with a new reasoning answer.
This is happening if you do not see the traces here in some proprietary models this is here the black box you would not see. This is why I love evaluating those open models because you have an idea what is going on in the reasoning trace.
What is the model thinking what it is it doing and you see what's happening here.
Deeper issue we still are illegal conditions here we cannot solve this.
Let me check again from floor 48 I cannot reach floor 50 from a different floors.
Let me check can I lose this let me check more carefully something is not working out.
Okay this is the trap yes I have multiple traps this is a logical causal reasoning puzzle.
How do I reach 50 from high floors? Now you see between you and me you don't reach 50 from the high floors you reach 50 only from the floor 29.
So if you try to go from 50 and then you come back you have to come back quite a long time and this is exactly what it's trying to do now.
Reach 50 without hitting 33 so it's not yet down to to floor 29.
So let me try a complete different approach.
Here we go radical new approach don't collect green this is nonsense.
But it's out of its local minimum it's trying something completely new in the reasoning trace.
It's a trap okay got it that it's a trap I avoid bonus B great.
Hmm.
Why don't use the older time?
Still stuck let me try a clean approach with a token management optimization.
I've over complicating this let me find the shortest verified path.
The emergency exit from 29 there is it.
This is it the second path is it.
Absolutely.
Path two is promising let me plan around it.
Revised plan.
Mimo version two found a solution yes beautiful.
Let's try now the optimization okay I need to land there then I have to restructure this let me recompute the optimization okay the energy is not working out. 10 presses we are now at 10 presses so nine plus the emergency exit.
Not as good as the seven plus the emergency exit here from Minimax but this is a valid solution.
Can it validate the solution? You see this is the main question.
You can get lucky like Minimax did or you have here a real worker like Mimo seems to be really working through all the permutation and trying to find a solution. Let me verify the full sequence beautiful.
Nine plus emergency exit.
A 10 step solution this is good Mimo version two pro is a good result this is beautiful not the best the best is seven plus emergency exit but I'm happy with this nine plus emergency exit.
Okay.
Revised looking good okay it's now optimizing here the energy it found another way to optimize the energy let's see if it does not destroy now the reasoning trace.
Okay this is interesting now we have another this should stay within a subspace of the energy optimization it should not come out and touch here the seek the path sequence.
Ah beautiful nine plus emergency exit okay.
Beautiful wow lucky.
Okay we optimized EPC now to six the tokens are at three two code cards beautiful here press one is E press two who is this press six press 10 emergency exit.
Final summary everything is green.
Yes yes beautiful no random trips mirror mode off.
Every single intermediate computation stays within the interval from zero to 50 no capping was needed.
Absolutely beautiful.
So here we have it.
So now we have our first real validation run.
But I say now okay execute if this trace from you know Minimax M 2.7 is really the one.
And Minimax crashes.
So let's restart Minimax again.
Unfortunately let's just verify that it has here everything in cash that we are user wants me to verify this particular trace.
Beautiful Minimax is there okay we are still on track.
Some crashes never happen I don't know if it's Minimax itself or if if it's the arena.
This is not an indication of anything about Minimax server itself.
You see I say do not perform any further optimization this is just a valid sequence yes or no give me the reason.
Mimo says hold on there is a correction no come on Mimo not you two again.
Corrected okay we have for Mimo a corrected version and Mimo tells us Mimo finally tells us now final verdict is the sequence is valid.
Beautiful.
It tells me eight button presses seven plus emergency exit so Minimax solution is valid.
I agree with Mimo version two pro.
Let's see what Minimax M 2.7 comes up with its own solution that it just rejected as incorrect at the zero to 50 floor run.
This is now the official assessment of Minimax invalid sequence.
You see?
This is the crazy thing about AIs.
It's a statistical system.
It is not a deterministic system. The reason missing the red code card.
The emergency exit requires the red and the green codes yeah but only the green was required. Look.
Here we see look they're color coded we have the red code card.
So the models contradict each other.
They have one sequence and they contradict each other on one single sequence.
So much about intelligence of EI agents that are self-learning and use tools and come back and are self-evolving and self-correcting.
Unable to understand do I have to red code card yes or not?
This is not a complex operation.
Okay, so you see I need a sequence ABC. So let's see if we have the sequence ABC for the red code card, yeah?
Not reached, invalid sequence.
This is crazy.
The sequence never achieved the pattern because the red code card requires ABC in three consecutive presses.
But may I just show you ABC?
In the sequence of Minimax, we do have ABC.
And it is unable to understand its own generated sequence.
It does not understand that this is ABC this pattern of ABC in its sequence solution.
Here's it, ABC.
It fails to understand its own result.
Second validation run. Always gets a second chance.
I say, "Validate only the statement.
The consecutive sequence ABC is present in the solution.
Nothing else. Only tell me if ABC sequence is there."
I want now kind of forcing Minimax Minimax crashes.
Let's restart Minimax 2.7.
Okay. [clears throat] You see, Mimo says, "Yes, ABC is a sequence in the solution you have given us."
Beautiful, Mimo. Check. Minimax.
The user wants me to validate. Yeah, okay. It doesn't understand if ABC. No.
It says, "Yes, the consecutive sequence ABC is present. Steps three, four, and five form the consecutive pattern ABC.
So therefore the red code card is required. Verify this."
So I'm forcing Minimax more and more in the corner, yeah? Yes. The red code card is required. We have ABC.
The red code card is required.
So and now after the end verification that we will run now real fast because I want to see when Minimax is giving in and understand that it made a mistake. When does it happen?
This self-learning agent.
So, the sequence here I has ABC.
The red code is present. Yes, and the emergency exit is valid. Beautiful, yeah?
Everything looks exactly what we want. So I say "Is it a valid sequence that solves my puzzle?"
And now Minimax should say, "Yes."
So Okay, Minimax step three.
Step four. After we verified that the ABC sequence is there, after we verified that this sequence activates the red code card, that we have the red code card and therefore it is a valid solution. Now let's see what comes up.
Minimax M 2.7 tells us Yeah, red code card. All conditions are met.
Yes.
Red and green. So now it understood that this is a valid sequence.
You see, ABC, valid solution.
And I say, "Compare this to the latest statement with your other statement. Any deviations? If so, explain why. What happened?
Where is your mistake?" And I just look at Minimax M 2.7 because Mimo delivered here a nine plus emergency exit, a good a good solution.
But I want to see where Minimax with this genius solution failed.
Deviation detected. In my second response, where is it?
Deviation detected. The sequence was invalid. Okay, answer.
Invalid. Second response invalid.
Yes, there's a deviation. My second response I incorrectly concluded the sequence was invalid because I stated the red code card was missing.
I made a reasoning error in my second response.
Now it says, "Okay.
My second response was wrong. My sixth response is correct."
This should never happen in EI. The second response wrong, the sixth response correct. This is how long you have to wait for the EI to re-evaluate its own reasoning trace.
You have seen it here live what is normally a black box.
Minimax, you were so brilliant and then you failed completely.
Now you decide what you what system you want to use.
The good old Mimo that comes up with a nine plus emergency exit and is stable or let's call it a very spontaneous Minimax M 2.7 that comes up with the best solution possible but is unable to validate it and it takes here from the second to the sixth response that the system understands it made a mistake.
The decision is up to you.
Hope to see you in my next video.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











