SkillRAE (Skill-Based Context Compilation for Retrieval Augmented Execution) is a new approach that addresses the fundamental limitation of standard RAG systems by explicitly rescuing and grafting boundary conditions (subunits) into a logically bound, low-token payload, thereby bypassing the LLM's stateless amnesia and providing a fully resolved blueprint for execution, as opposed to simply feeding uncompiled, isolated tools into the context window which forces the LLM to perform dependency resolution on the flyβa task it fails at with high probability.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
RAG for SKILLS: Retrieval Augmented Execution (SkillRAE)Added:
Hello community. So great that you are back. Today, let's talk about something crazy. We need now a rack system for skills. Yes, you heard me right because up until now no standard AI engineering was if you simply feed here an AI with the correct tool repo, it has enough reasoning power to figure out the execution. And now we have a brand new study that tells us well, this is mathematically naive. So, let's start and we just continue where we left off with my last video. We were talking about quantizing AI into executable skills as mathematical operators, and you already see we went here from a action space transition here from a continuous like distribution to a discrete Markov decision process with a finite base set of skills, and now we continue. Especially if you work here with open claw autonomous AI systems now, those EALM based agents increasingly rely on skill libraries, hopefully reusable skill libraries, to solve for the artifact which task such as document centric workflows, data intensive analysis of whatever you have in your particular domain. Now, there is a claim I could not verify it that about 4,700 new skills are added per day to the global skill repos here, which is crazy. We are here mid-May 2026.
So, we need a rack system now for skills. And guess what? We have the first publication and it is called Ray.
So, you know that in skill bench, our classical benchmarking how well agent skills work across diverse task, this is here the study. This is version three from March 2026.
We saw if we have here, I think it was 86 task here across 11 domains paired with core into skills, how good does our systems perform? And you have here everything from a Haiku 4.7 to an Opus 4.6, Gemini three, and you see here without the skills here the first block.
This is not prominent. Then with the skills, the dominant performance here.
And then with self-generated skills, we are back almost to no skills, which is quite a consistent behavior across all the different models. So, self-generated skills, you know exactly what you have to think about this.
Now, let's come back to the standard rack. It's like asking an LLM to answer your question from a PDF, and you need to retrieve new updated information that is not available in your parametric knowledge, no? So, you retrieve static information. And for a physicist, standard rack is like looking up a fundamental constant of a whatever table, whatever you have, yeah? The mass of an electron.
So, you retrieve text, you concatenate it to the prompt of the LLM, and the LLM reads it, but of course the information is passive. Now, in this new skill ray, you're retrieving operators. So, this means executable functions, code blocks, API hard calls. So, now we are completely different. We need a different mathematical space for this, but luckily in my last video, we already prepared for this.
So, I'm not just retrieving here a particular physical constant here or a parameter, no? You are now actually retrieving differential operators, or if you would like to see it through the glasses of theoretical physics, you retrieve kind of a Hamiltonian of a system.
Now, skill ray is essentially a retrieval augmented generation system, but now mathematically projected out of the space of this passive factual knowledge of this textual into the space of an active procedural operator.
And this is the reason why we call it not rack, but retrieval augmented execution, since we have executable operators in front of us, and therefore it is called ray.
Great.
So, we have a new study, the Chinese University of Hong Kong and Shenzhen here skill way agent skill based context compilation for retrieval augmented execution May 11th 2026.
Now, as I told you a skill bench here what is here a typical agent skill? How is it presented? We have here some natural language description together with some optional linked files know a skill MD file our assets or pip text template or references here our search MD or scripts or search Google Scholar python files. This is here the classical agent skills that we are talking about.
If you're not really familiar I have quite a lot of videos here on skill and multi skill MD files here. Why skill is not enough? And one of my last video was here super intelligent rack system a classical rack system here for meta.
And yeah, AI eigenvectors as the human defined skills we already built those mathematical spaces, but now we go a step further.
In the classical rack system the system operates like a bulk scoop, now. It calculates a global interaction potential you know our famous cosine similarity between the task and some available and now let's call let's call it vectors because say these are macro molecules, you know, three-dimensional objects. And dump now the top 10 heaviest most relevant molecules of course vectors into the prompt.
The problem is the agent needs to synthesize now a specific reaction.
It might only require a tiny functional group here on this complex macro molecule on the sphere maybe only a sub unit here that is located on the macro molecule that was not scooped up.
So, we have a problem. We are missing here some functions. We are missing some operators. We are missing skills. We are missing the ability to perform the job.
So, if this was here this image here in rack now guess what? We build a graph. Of course, what else can we do in AI, no?
So, the offline stage is we have a skill repo, I don't know, 10,000 skill or whatever you prefer new domain or whatever you have as a standard repo, and then we build a skill graph. This is offline.
So, we have, of course, skill community.
So, we have different levels of correlation.
Then we have the pure skills that you know, and then we have all the sub units of the skills, and this is now the interesting part. We don't go with the prefabricated skill MD files. We analyze it, we see there's a lot of redundancy, there's a lot of nonsense, there's a lot of things we cannot verify coding, and therefore, yeah, we have to clean up here our skill repo.
Now, the authors treat your skill library that we have as a multi-skill interacting system and act as a, if you want, intelligent compiler, if you see it in computer science, or an enzyme, if we go to biotech.
So, they map now the state space at three levels. As I told you, the communities, the bulk clusters, the skills. The skills would be here our macromolecules, and then the sub units, and if we go with the biotech image here, these are the functional groups that are on the sphere of the macromolecules, and these are just some sub skills or sub units.
Now, they select now here for building the graph the best macro skills, but then they cleave highly relevant orphans sub units here from unselected skills also on, and graft them onto the selected one.
So, this is now interesting. So, you say, "I do not go with a predefined skill structure or with some predefined uh description of what to do." Because they found out that the quality of the skills that you find on the internet is um is is different, no? Can vary.
So, we build now the multi-level skill graph offline from all the source skill MD files that you have access or use at this amount files or that you build up yourself, no?
So, the graph builder extracts now the normalized procedural element and constraint sub unit from each skill, deduplicates them globally. We don't need duplicates and connects each retained sub unit back to the source skills. So, we build a graph.
And then we embed some sub units here and you're not going to believe how we do it with the old method of a sentence transformer embedding our expert system.
I guess I have more than 50 videos here at the very beginning of my channel on expert. So, we are back to this methodology and the skill description of course embedded here with the same model at the retrieval time.
And the skill communities are constructed by embedding here the skill representation and they're applying here a K-means clustering algorithm. So, everything that you know, that you are familiar with.
So, we have offline build our graph and now online we have now our task request here from a human that's me.
And now something is happening. We have now we have to build here of course the manifold and then on this manifold we have a top-down retrieval and a bottom-up retrieval.
Now, just to make it here very simple, I will give you the mathematical definition in a moment. The top-down retrieval is simple, no?
You just go here in a classical way and you say, "Okay, I select the skills that have I don't know, the keyword quantum in it."
But, we also have a bottom-up retrieval going here on the graph finding here there's a specific sub unit U3 on a skill that was not selected here by the top-down approach. So, we also go here down to the sub units, analyze the sub units and say, "Wait a minute, this sub unit to an unconnected skill would be highly helpful for the task." And we bring up. So, this means bottom-up retrieval here from this particular sub unit of a skill or a sub skill.
Then, yeah, we form here a new graph. We have the all the relevant skill and the sub units, beautiful.
And then, yeah, we just go and do our exercise here.
So, let's have a look because we have to code this, now. So, what is the mathematics? How can we materialize this?
So, we have an offline the manifold construction here, now. Our skill repo is commonly structured here as a multi-level bipartite graph.
The graph is here simply of some elements. So, let's have a look. The first one is here the skill communities, now. The high-level execution, the invariant groupings.
Then, with S we represent the skill nodes themselves. Those are the executable macros, if you want. U represent here the sub unit modes.
The localized proce- procedural instruction. Then, of course, we have to have our edges in the graph, now. Those are the edges mapping the macroscopic skills to the microscopic sub units, very important. And then, finally, we do have to have a deterministic mapping of the skills to the parent community.
So, with this graph structure, we already can start.
And now, as I told you, we have a dual signal retrieval, a bottom-up and a top-down.
So, given now we have a task query, system calculates an effective Hamiltonian for the skill selection using now and you guess what? A superposition, of course, what else? Of our top-down, the macro and the bottom-up, the micro signals.
Now, for the bottom-up signal here, project relevance here from the sub unit space back to the skill space. This is a simple projection that you are familiar with. And if you want to know, sigma is here our embedded similarity and 1 / deck is here kind of a dampening factor.
Why? Think of it. It is something like an inverse frequency penalty, ensuring that the highly generated hub sub units, no, like the standard when you have import OS lines, no, that you don't have this as a dominant element here, since this is shared by many skills, so you do not arbitrarily inflate your energy score of a specific skills. This is it.
So, the final selection energy is given here in this formula. And here we have, of course, our top-down community masking now. And if a skill does not belong to the highest scoring bulk cluster, it receives here a severe structural penalty. So, this is a very simple idea. You can go, you can make it much more sharper, much more complex, but if you want to go and start simple, beautiful. Here we have it. Explained top-down retrieval and a bottom-up retrieval. And now we can build our graph with the sub units that are not in the selected skills at all, but the system decided, "Hey, this is absolutely fascinating." And I'll show you the LLM models in a second.
But then, as I told you at the beginning, we need more than the classical rack.
It's not enough just to bring everything back into the prompt, no. We have to have a context compilation if we talk about operators, executable operators.
So, let's do this.
If you want This is the core algorithmic novelty of this beautiful paper.
So, K be the top selected skills.
Suppose an unselected skill contains, however, an incredible useful sub unit U.
And now the compiler, if you want to rescues now here our sub unit U, and computes a deterministic local building affinity to graft it onto the most compatible selected skill S star.
So, you see we go here with a binding affinity that you know maybe from from from biotechnology.
And of course, you might say, "Yeah, that's great the idea, of course, but if I have no mathematic if I have no mathematical expression formulas, I cannot code it. So, what is the other mathematics behind it?"
This is a screenshot here explains everything beautifully. This is how you attach sub units to skills.
They have here this simple formula for their idea, and this is simply what they are here optimizing here in this mathematical representation. Have a look at it. It is not complex. You have all the details explained in appendix A2 of the paper.
It's not really the most important thing, so let's continue. Yeah, I got a lot of questions in my last video about the computer infrastructure that you would need. Here I've given you here what the authors stated here what they work with. A Linux server running Ubuntu.
Yeah, then we have 56 physical cores, 112 logical threads, 1 TB of RAM. They are on the Docker and they have eight Nvidia RTX A5000 GPUs.
Great. So, it is not overkill at all.
This is great.
But, let's have a look at now the performance. What is the result here?
And this is now the overall downstream performance. So, we have here our skill bench and also our agent skill OS. So, we have two benchmark if you want. And let's see if we compared here to all the different methods that we have. Skill agent skill OS or vanilla retrieval or skill router, one of the best up until now. And now you see that in the very last line here the skill ray, also this method that we published that you have had a look here in this video is of course outperforming here all the other methodologies.
So, this is It's We build a skill graph from all the available skills and skill MD files. We really check in detail all dependencies that we have, and we make sure that this is not just some stuff we put into the prompt and send it back to the LLM, and the LLM simply has to fail because this is not compatible and this is not operational.
They worked with two different combination, and I want to show you this. Here they have here the Codec CLI with a GPT 5.2 and a Gemini CLI with a Gemini 3 flash. So, neither their flagship models of 5.5 were 3.1, but you see here in general the skill way the full version has here with GPT 29.26, with Gemini 28.85. So, I would say, "Okay, it is comparable." And then for the ablation, you see here without the bottom-up retrieval, without the top-down retrieval. And here you have the figures without the context compilation, and you see now the importance if you would miss one of the part, how it would change here the performance of the overall system.
And we already at the summary. Great.
So, standard AI engineering, as we started this video, assumed if you simply feed an LLM with a correct tool repository, the LLM has enough reasoning power, enough intelligence to figure out the execution.
Skill rate proved that this is mathematically naive. The LLM is essentially just a raw CPU.
If you deliver some uncompiled, isolated tools into its context windows, it just forces the LLM to do dependency resolution on the fly, which it will fail at a very high percentage rate.
So, the new insight is retrieval is not enough. The classical rag is not enough if you go for operators.
Because now you have to do execution requires a compilation.
And now, I'll put this in the parentheses, of course, you can argue, "Hey, wait a minute. This execution, is it happening not in the harness of our agent?"
Is it It is not in the core of the agent LLM, no? I mean, we could move it there, but we just found out this doesn't make sense. So, suddenly, we have to have in the harness of any AI agent compilation, execution. So, I would say the intelligence of this AI harness will increase significantly.
So, maybe you place multiple EIs also in the harness. Then, you have the coordination problem, and you know all this stuff.
Anyway, let's stick here to our topic by explicitly rescuing and grafting here the boundary conditions. This means the sub-unit of details that are so important into a logically bound low-token payload, we bypass the LLM's stateless amnesia and give it a fully resolved blueprint.
Isn't this beautiful?
Hope to see you in my next video.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











