SkillOpt cleverly replaces expensive weight updates with an automated feedback loop for text-based instructions, proving that refining the "logic layer" is more efficient than retraining the model. It marks a significant shift toward modular intelligence where AI improves its performance by perfecting its own procedural knowledge in plain text.
Approfondir
Prérequis
- Pas de données disponibles.
Prochaines étapes
- Pas de données disponibles.
Approfondir
Self Evolving AI Skills w/ GPT-5.5 (SkillOpt)Ajouté :
Hello community. So great that you are back today. Today we talk about skills in artificial intelligence. No, I mean come on, you don't want to write your own skills, no? You want your skills to be self-evolving.
So what we need? We need a frozen target model. Let's say we take a GPT 5.5 and we have somehow an even more intelligent model here on top and then we have a new methodology and this new methodology is by Microsoft.
They call this skill opt executive strategy for self-evolving agent skills and they do this in cooperation with Shanghai Jiao Tong University, the Tongji University and the Fudan University published May 22nd, 2026.
So let's have a look. Yes, of course, you have your GitHub repo. A lot of people always ask me, where is exactly the GitHub repo? Well, Microsoft What is the main idea? Now they say, okay, so we have an LLM and we know how to train an LLM. Now that we've frozen the LLM, can we now have a mapping similar ideas how we do the deep learning also now apply to the harness of the AI, no?
And we want to restrict ourselves since it's Microsoft to the text space and not to a higher mathematical complexity space. And you say, okay, so let's do this a simple mapping, no? So the parameters of our LLM become now a skill document.
You know this, no? Skill markdown file, no? The gradient direction is not very interesting in the text space. You do have a trajectory derived edit proposal and when I read this, I could understand nothing, so I will explain this in detail because it took me quite some time to understand what they are talking about. The learning rates they map into a budget restriction. The validation check, they go here for a gate structure and here the momentum average, they go here for Apple wise slow updates and there's a particular reason I'm going to explain in detail.
So let's start. As I showed you, we had a frozen target model and let's say this is our GPT 5.5. This is here if you want the intern the student model executing now some task integrating this here in the harness. And then we have the boss.
We have the optimizer model. Now, this is not a senior engineer is writing here the handbook. And guess what? This is our skill documents file. So, they somehow hysterically work together.
Let's have a look. And here you have the explanation by Microsoft. Now, look at it. Take a look. And if you understand it, great. I understood nothing. I had to do read everything.
So, this is now more or less here the flow diagram of the methodology that you just saw here.
So, data set they go with a classical machine learning training split validation split and test split.
And they have a fixed agent A. This is our frozen agent now. And now we have a rollout.
And we have a rollout batch into multiple mini batches and we have here the optimizer model the secondary uh model looking at this optimizing this and then we have a merge operation then we have a rank and clipping operation and then we have some candidates and then they go in a validation gate and then we have here the best of the best the best skill MD file. Great.
And we have something else to explain later.
So, if you want to have this in text, beautiful, but let's go on.
Now, as you noticed in this video here where we talked about a four-layer harness definitions is everything because people define a harness in different ways. Luckily, imagine we would have a coherent definition. No way. Here a skill is defined in the following way. A natural language policy inserted into the agent context before any execution consistent with the recent work treating skills as reusable procedural knowledge for agents.
Beautiful.
It's a policy. You see, now we are focused here on text-based structures.
So, beautiful. We use M to denote the frozen target model whose behavior is being adapted only through a skill optimization. We do not touch the tensor weights. We don't modify the weight of the LLM. The LLM is frozen. So, we just go here into the harness. We do a skill optimization and we wanted it is a self-learning.
So, luckily there's a little little tiny bit of mathematics also in the text world. So, we have the harness H, the task X, the skill S, and the execution of this produces now here a trajectory and a scalar score.
Beautiful. So, the score is within the interval from 0 to 1 and the trajectory is generated here with the scalar. So, we have now knowledge how good was this idea.
So, the target model actually executes the code. Let's call it in a sandbox, no? So, if it writes a bad Python script, the Python compiler throws now a real error. So, we have a feedback from, let's call it a environment, a sandbox, an error message. And the system now records everything into a tau trajectory. The model stored the exact tool calls, the API response, and all the compiler errors. And now we have everything that we need for the boss agent, for the optimizer, no?
So, as I told you, train split, validation split. Yes, split. Yes, beautiful. It is rather simple. I explain here everything again if you want to have it written down. But what is now the harness? Now we have the forward pass. The forward pass in RL would be or DL would be now in this text space to roll out evidence.
So, as I told you, the harness records everything. The task metadata, the messages, the tool calls, the observation, the command outputs, the final answer, the verify feedback, and the benchmark specific context such as to spread your previews to document references or some compact execution traces. This is the harness, of course.
And now it's interesting. What we have normally in deep learning as a backward pass, now, this is now mapped to a mini batch reflection, as Microsoft calls it, now.
I love the marketing terms of Microsoft.
So, what is now absolutely important to understand in this, and it took me quite some 5 10 minutes to understand this, the optimizer model, the boss model, now, only reads the text traces. This is interesting, that the optimizer has no contact to the external environment.
It could be a limitation, a weakness, but this is how Microsoft built this.
Okay. So, the optimizer model reads only the text traces, the trajectories, where we have also the error log. And guess what? Reads here the mini batch of trajectories as text log, and it sees, example, attempt one, the model wrote to the F local variable compiler said error, the data frame object has no attribute. Look up to the score zero.
Now, the optimizer looks at mini multiple of them, now.
Mini batches, let's say eight of those logs at once. Notices here a systematic pattern. All of those fail here. And deduces the student model keeps using outdated pandas syntax, now.
I need to write a skill rule in my best skill MD file, forcing it to use a more modern syntax. And then we just insert here the memory MD file into the system prompt or that the user prompt.
This is it. This is all that is happening with the optimizer model.
So, since the model since the optimizer model has all the error logs, it hopefully understands all the syntax.
But you see, suddenly, we are not at the LLM with the tensor weights, but we are out in the harness, and suddenly we have to really care about syntax and protocols.
Then the next step is now the really interesting step I would say. This is the bounded text update step.
So the learning rate analog in skill up this here as I showed you already the edit budget.
The maximum number of skill edits, textual skill edits applied at a particular time step T. So after the aggregation, the optimizer model ranks now the merged edit pool by expected utility and clips it to the top three, five, 10 edits.
And I ask my the whole time, "Hey, what is the difference here just from prompt rewriting and prompt optimization and context optimization, huh?"
This is it.
If you want my understanding, this is the key difference from ad hoc prompt rewriting, yeah? Why we need to this?
Because here we do not limit ourselves to one rewrite or one override.
But here we have the top 10 edits. So we have, if you want here uh a spectrum, a multitude of possible ways forward.
And now the selected edits produce now total a candidate skill that is now necessary for this particular domain, for this particular task.
So you see, this is the secret sauce.
Instead of letting the optimizer rewrite the whole prompt, skill up restricts here the update to a particular limited budget of atomic edits, of little tiny edits only. You can append, you can replace certain structures, or you can delete, but not everything at total, yeah?
We have here clear hierarchical structure. And by default, we use a cosine decay schedule here for this budget. So starting with larger structural edits, decaying time by time into finer-grained tweaks.
So this is interesting. You want that just not a simple one-time override written prompt, no?
You say, "Hey, this is whatever we found out. This is important. This is a valuable piece of information. And now, with a budget of, I don't know, five atomic edits, I can append, replace, or delete here for a particular task from a mini batch here further knowledge elements." And now we understand what is happening here, no?
We have here our optimizer model, no contact with the external environment.
Add, delete, replace. We have emerging ranking clipping. Then we do have a validation with a sandbox, let's say.
And then we have our best best skill MD file. Great.
Now, let's say, "What the hell is here the lower part of this diagram, no?"
The authors argue that fast updates learn from the current batch, no?
But, they also say, "You know what?
Those local steps that we do fast can miss if we have a systemic drift system." So, therefore, they have to have a counterweight, no? And they go now here for an epoch-wise slow update.
And they say this update learns from adjacent epochs. So, it's not here quick quick fast update, no. This is after epoch we have a look at the previous epoch skills and the current epoch skill. We look and AI looks at improvement, at persistent failures, at stable successes. It writes here textual and epoch-wise reflection. This becomes now an optimizer meta skills. And this updates now here, if you want, the domain-specific knowledge here that can detect the systemic drift immediately here into our known procedure.
And at the end of the epoch a skill upsamples your training under the previous epoch skills and the current skills grouped in together.
Improvements, regression, persistent failures, and stable success. We just had a look at this.
So, benchmark is not a real interesting part. How well does this thing work?
Now, the authors Microsoft decided here to select some particular benchmark. I don't know why, but yeah, of course they went here for the biggest effect of their methodology. So, absolute fair, everybody more or less does the same.
So, let's have a look.
We have now our frozen student model.
Oh, this is a GPT 5.5. Not really a stupid model, but okay.
And we start with no harness at all. We have just a chatbot, no? There's nothing Claude, code, or Codex, nothing, no?
So, when we have now a search Q&A, a spreadsheet, an office Q&A, and doc, whatever, life mathematic on Alf World, and now let's have a look.
And you see here the first line is, if we have no skill at all, what is the naked GPT 5.5 performance here? Let's say 77.7%.
Great.
The human skill would be 81.
The LLM skill would be 80.9.
Text grad would be 81.4. Gopher, 84. I have videos on both of them. And now the new skill up, look at this. It is even better than 84.8. It is now 87.3 for this particular question and answer benchmark, no?
So, you go there. Great. Spreadsheet, multi-run, code generation with up to 30 turns and real panda runtime here. Default mode is multi. If you go for the office Q&A, you have here 20 up to 24 tool calls. This is nice. And Alf World I had in my last video here, up to 50 steps per episode for embedded interaction where we have an embedded AI, like playing Alf World, the game.
So, this is here no harness. This means no external tools, no code execution, no persistent file system in the background. You ask a question, the skill document is prepended here to the system prompt, and the model spits out the answer. This is it.
Great. So, you see according or in comparison here to GaPa, yes, we do have a little bit of 2. 5 percentage point improvement.
What about QN 3.6? We go for 35 billion pre-trainable parameter mixture of expert model with an active 3 billion stack, and you see here, okay, great.
More or less, I would say, similar data, but of course, it is not as powerful as GPT-5.5. So, instead of skill up 87.3, we have now 80.3. Great, and you see for the rest of this.
Now, of course, you might say, "Hey, wait a minute, this is just a frozen student model. What is the optimizer model here, now? What is the boss model that corrects all of this, looks at all the mini batches, now? Corrects all the traces, provides correct solutions that are then selected, ranked, and implemented into the best skill MD file?
And guess what the optimizer model is?
A GPT-5.5.
So, this is an interesting situation that we have. For example, here a student, frozen, and a GPT-5.5 as a big boss teacher model, active, looking here at the trajectories without any external contact.
So, they went with one of of the best AI models on this planet, and you might say, "I wonder why."
Okay, this is no harness. Great. Now, you know what's coming up, now?
Now, we have the Codex harness by OpenAI, and we have the Claude code harness by Anthropic. And we do more or less the same for GPT-5.5.
Now, have a look at this, you see search Q&A 87.3, skill up 85.9.
So, if you say, "Okay, without a harness, I go here." The search Q&A 87.3.
And now, with the harness, I have 87.3 and 85.9.
You decide how beautiful this harness structure of Codex and Claude code or in this Microsoft implementation.
Where you might say, "Hmm, is it possible that this Microsoft implementation has some limitations?"
More about this in 5 minutes.
So, here we have it again now in publication here from the authors, and they have here the main results. You see here the GPT 5.5 the harness as a direct check. Here, the improvement here plus 9.6%.
So, yeah, there is an improvement. But, if you go for GPT 5.5 Codex, you see the improvement is only 5.5% and the GPT 5.5 on a Claude code harness, the improvement is only 4%. So, okay, these are the real improvement steps if you go with this new skill up methodology, and they focus here primarily the authors on GPT and the Q1 models, the Q1 3.5 and the Q1 3.6 mixture of expert. Great.
Now, for the methodology comparison, this is now interesting, yeah? They have here for the different benchmark here, they show you here in, my goodness, pink text grad, in orange keeper, and in green the new skill up, yeah? And of course, it is green outperforming everything else because, yeah, we have found the right benchmark test.
Congratulations. Now, honestly, interesting to see, but there are open questions, yeah?
Yeah, if you want to see Gaipa or the newer than Gaipa and was upgraded to Wister.
Or if you have here video 37 minutes explaining Gaipa in detail. It was here a genetic AI by MIT and UC Berkeley. You have the videos on my channel.
What I found now interesting and this is now my personal reflection on this study. I was interested those correction, those new best skill MDs.
What is in there? What really was modified here on a skill level? After performing a whole this complete new methodology, what is it?
And beautifully the authors absolute transparent give us this information, and I assumed here this massive gains acquire here, I don't know, 10,000, 20,000 token, yeah.
It's really detailed instruction, information, knowledge how to do this, and then you look at it and you see the edits here for the office Q&A is exactly one.
One edit, one instruction.
And the token length here, it was officially 145 token.
And now it is 883 token length, the best skill MD file.
So, you might say, "What? This is it?
This is the improvement?
Maybe this is only one or two sentences, no?
Is this all that we found? And here again you see here the GPT 5.5 and the GPT 5.5 student teacher run. So, this is really if you want one of the best available AI models currently on the planet. So, this is, let's call it, the best result you can have here with this methodology.
Okay, but you see also by other one, life mathematics, exactly one edit.
Or by Outworld, two edits. Or by documents, three edits, no? and you think, "Hmm, all of this, what is not inside?" So, let's have a look, and before we have this look, I wanted just to show you, the authors want to stress here, this is a skill MD file, so this is a file that you can transfer across the models, across harnesses, and it is a simple one-file deployment.
And if you go here from different harnesses here, let's say from a Codex trained to a Cloud Code, you can really bring over here the benefits, because it seems that these best skill MD files has such a unique new insight that Codex and Cloud Code both profit from this.
So, it is improving your result.
And at this time I ask myself, this is my reflection, so is it just a harness syntax optimization, now?
Since the human model is frozen to harness can only optimize its own harness elements, memory skill prompts, and the way data are presented now to the frozen LLM. This is now the filter, the lens, now, how the data are presented to your frozen core LLM. This is the harness, and is it just the syntax, or is it more?
Now, it turns out the answer is no, it is not just the syntax, it is more, because the optimizer agent, wisely chosen by the authors, a GPT-5 {point} 5 is not limited to syntax.
The recommendations by GPT-5 {point} 5 also include, apart from syntax, also procedural knowledge. And this is what you expect in a skill MD file, now.
So, in this situation where we have a student GPT-5 {point} 5 and a a teacher GPT-5 {point} 5, and one of them have real-world validation with the validation gate as Microsoft defined it, we have here the contact.
And I will give you here the insight.
This is now what was found. This is the result of this text gradient optimization.
So, let's go here for the search Q&A.
This is here the most important edit.
And it is inferred the expected answer type from the clue wording, then choose the shortest canonical entity supported by co-occurring distinct evidence.
This gives you the improvement or office.
Treat article past pages as primary evidence.
Lock table date unit contents and output exactly the requested rounded value without extra labels.
This is what provides the improvement, the massive improvement you have seen.
Mhm. So, yeah, okay.
I don't know how you feel about it. I expected from this Microsoft methodology some deeper learning, some deeper insight, no? Some more complex, I don't want to say waterfall model, but more complex structural insight, yeah?
But, they give us here figure four in the paper here, beautifully exactly this is it.
The learned rules extracted here from the final best kill MD file of this configuration GPT-5.5 and GPT-5.5.
Beautifully transparent, exactly how a paper should be. Congratulations to the authors.
Just to be clear, no? The optimizer, our GPT-5.5 teacher, never sees the final validation error. The optimizer model is only allowed to look at the training split data. As I told you, it reads here the batch of eight training trajectory, only on textual level, cannot validate. It has no context to the real world.
Then groups here the text the failures together since it has the error log from the Python compiler and computes now based on this information what Microsoft calls a textual gradient.
Now, between you and me, the textual gradient is just a proposed edit here to the string, nah?
But, okay, let's call it a textual gradient. So, whenever you encounter a textual gradient, you know exactly what it is. Throughout from a group, you have a training trajectory, analyze the failure, find common patterns, and compute here the proposed correction.
Yeah, edit to this.
If anybody on this world made it now to this deep in the video, I have a bonus for you because the very same data has a second study.
And this you should also read. May 22nd, 2026. This is now here more or less by the same authors, Fudan University, Microsoft, Shanghai Jiao Tong University. But, they go now a systematic study of model-generated genetic skills.
And they can show you here the development here in this study from a raw experience to the skill consumption. And this is highly interesting. It is a little bit more on the theoretical side to understand the generic schemes.
But, I highly recommend this study. And the authors here ask whether such skill actually work, when they work, and what makes them succeed, or what makes them fail. So, some This is some theoretical insight. So, highly recommend you read this study here as your second study after you finish data first study, nah?
And of course, you have here also on Microsoft Skill Lens. Here your GitHub repo just 4 hours ago the update.
Beautiful.
And [snorts] they go now and they evaluate the full trajectory to skill life cycle across three simple stages.
You have to experience generation here in stage one, then you have the skill extraction, and you have the skill consumption.
And they have some beautiful insights.
So, if you are a little bit on the theoretical side and you want to have a little bit of a deep dive into the channel topic, I highly recommend the second paper.
And okay, I know there's no human left, but theoretically, if an alien sees this, I have now a second bonus here in this video because just yesterday I published the unfortunately this for members only.
But, if you are interested in this topic, this is a video where I showed you the AI harness optimization by AI, not limited to skills.
And I showed you here the complete process. Also, we go with an optimizer agent that now completely optimizes here the complete harness. So, we have everything from a prompt, from a tool, from a library, from a workflow optimization integrated in a higher complexity.
So, if you want this video here at the beginning was just for a simple Microsoft skill dimension, but if you really want to go for the heavy stuff, you know, I highly recommend this video to you. I hope you had a little bit of fun, you had a little bit of joy, some new data, some new insights, you're interested in AI.
It would be great to see you in my next video.
Vidéos Similaires
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











