This research exposes the statistical insignificance of joint prompt optimization, effectively debunking the hype surrounding complex multi-agent tuning. It rightly shifts the focus from trial-and-error instruction sets to the structural engineering of input harnesses that unlock latent model capabilities.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Prompt Engineering is Dead: Proof Your Agents Aren't TalkingAdded:
Hello, welcome Unity. So great to do a back. Today, today we're going crazy, but we're going to analyze here a three-dimensional score tensor and we will find this very singular case where the system will just provide a beautiful solution and then we will build a pipeline flow exactly to utilize this solution. We will work with dual rack manifolds, but we will achieve something that is really really interesting you can immediately apply it. So, compound AI system, you know, we have multiple and then I'm calls each agent handles specialized subtasks here.
This is how we work, now.
But have you ever thought about how should we optimize now this singular prompt in all these agent and its sub agents?
What about a prompt optimization here as a holistic view? When we have dependencies of the prompt A to the prompt B to the prompt C, how are the agents interconnected? What are about their cross-referencing?
How about this core matrix here if we look at the variance here?
Now, you know, recent work we do this text grad or DS pi compiles an LLM program with end-to-end optimization.
Everything is real complex, everything is real crazy, everything is real non-linear.
If you ever doubt my words, you just put in text grad here in the search and you get the older videos from 1 year ago where we had a look here at DS pi, better DS pi, text grad, beyond text grad and DS pi, or if you just enjoyed DS pi here with compiling the lean four and the graph optimizer in pi G, whatever.
First study.
They tell us, wait, let's examine this in detail. Before you started optimization process that will cost you $1,000 to $10,000 here just in compute cost, let's analyze it.
And I have to tell you this is a controversial study because they tell us the prompt optimization in those compound AI system multi-agent system is just a coin flip.
But there's only one case where it really works. So, diagnose it when it helps in compound AI system.
And this is such a beautiful study published April 16, 2026 here.
And they tell us we studied two agent pipeline. Simplest case, agent A, agent B. A processes the input, does something agent B produces here the final output using here agent A internal response.
Also does a little bit. Great.
They say, okay, we generate now 10 diverse candidate system prompts per agent with varying strategy, varying tone, varying structure of the prompt, and exhaustively evaluate all 100 prompt combination on n equals 30 benchmark samples.
So, thousands and thousands of runs.
They use a Claude Haiku 4.5. This is kind of a mid-tier AI model. Amazon Nova Light, this is a budget here. And Sonnet 6 4.6 is here our LLM as a judge.
Beautiful.
Now, I give you the result.
They tell us the interaction between the agent is never significant.
If you look here at our three-dimensional tensor's core, they tell us here for the eight times B interaction term, this is almost always non-significant.
And if you really want to have the absolute number, it is about.18 to 2.15% of the total variance.
So, more or less, forget about it.
Here you have it here in a graph and here you have the different benchmark in Haiku and then with the Nova.
And they tell us here what we are interested is in the eight times B interaction and you see it's 0.2%, 0.5%, 2.1%, 0.5. You get it.
The joint prompt the optimal prompt for agent B is independent of the prompt for agent A.
The joint optimization, and this is the main message of this study, is mathematically unjustifiable in these architectures. Independent search reach the same optima at a much lower compute budget.
So, we will talk about a lower compute budget. We'll give you here also how we do this then else if we do not use DS pi or text grad for joint optimization. But the main message, joint optimization is mathematically unjustifiable.
And I do not like it because I had some beautiful success re-applying DS pi and also So, I'm not sure what to do with this, but let's go on.
They have a second message. They do a benchmark study on six further optimization methodologies compared against the zero-shot baseline here under equal compute budgets, 100 candidate evaluation. And they tell us it's a coin flip more or less every a prompt optimization on Claude Haiku. So, 49% or 50% of 72 optimization runs score below the zero-shot baseline. And if you take the little bit less powerful Amazon Nova Light LLM, the failure rate was even higher.
So, those optimization run do not really bring back the money that you had to pay for the computer infrastructure.
So, here you have it, both messages. The agent interaction, and they just look at the two agent system, is negligible according to their data.
And the optimization here of the prompts is just a coin flip because average gain over the zero-shot for all the different benchmarks is close to zero.
But you see there's one There's one thing that is really standing out. So, let's have a look. This is help steer two.
A particular benchmark. Why is this happening?
And they tell us the optimization is only happening, only helping us when the exploitable structure exist in the LLM inherently.
And we just have to activate it what is default not activated.
They call this the can't but doesn't pattern. Okay, so help steer two, the mechanism.
The task requires a structured JSON formatting and a specific rubric-based reasoning.
Now, those are the capabilities that our AI model possesses latent but did not utilize by default in the zero-shot.
So, this is now the AI model knows how to do it but does not utilize, does not activate this particular methodology or formatting structure whatsoever. In those cases, it really makes sense to optimize.
And on task involving free-form natural languages in all the other benchmark, the zero-shot performance is already near the model's ceiling.
So, this means if you have, let's go with this free-form natural language, any optimization of the prompt structure in those contexts is effectively what the authors call a noise chasing. Forget about it.
Now, they really jumped into the detail.
Look, Claude agent here, the agent A prompts >> [clears throat] >> for all the different hard part question and answer an exam here.
And they looked here at the mean score per agent A to agent B pair. And they said, where is the optimum? Where is the total maximum? Where is the linearity or non-linearity? And you know, less than none of all of this, they have a simple result. The joint optimum is merely the sum of independent optima. So, this means where we started from that we have a real complex non-linear system, this study tells us the joint optimization here over agent A, agent B is an over-engineered solution for just a linear problem.
Now, this hurts. This really hurts because this means that all this optimization complexity was more or less >> [clears throat] >> for nothing because as it turns out in their case, it's a linear problem.
Now, they have a very specific case, two agents. So, can we generalize this? Not really, but they give us a test. A test for us to find out if we have a cross relation between the agents.
And yeah, here you have it. Now, what I just explained to you, now. We thought everything is non-linear, complex, coupled, interdependency interwoven.
But what if the interaction between the agents is statistically non-existent?
What if we have just a linear system we can just optimize real real cheap?
Now, cost. This is interesting, now.
They tell us here for their particular diagnosis whatsoever, the DS pi compilation for DS set is about $1,000 to $5,000 US. If If you go text grad in the full complex end-to-end, it's about $5,000 to $10,000. If you really want to go here with the standard procedures that we have, now.
And now they tell us, you know what? We have an approach where you have a coupling test that tells you yes or no, they are coupled or not.
And if they are not coupled, they just can do a linear optimization here of a single agent.
And this test just costs about $80.
Then you have a bathroom test, costs about in runtime $5, and then, yeah, you can have a better re-ranking for 20 bucks.
It is not that I tell you you should buy this in no way. This is just to tell you if up until now we used DSP and text grad and was really expensive, this is just to show you that this new preprint tells you maybe they are not complex, maybe they are linear. Maybe it is much simpler. We just have to test if there is a particular coupling in the dependencies of the agent. Let's say, if you go with their idea, you have about a cost of $80, which is compared to $1,000 and $5,000, a real acceptable methodology.
But don't do it. I just want to tell you the idea behind it and then you decide what do to do.
We look at a three-dimensional tensor score where we analyze here the variance of the complete system. It's a complete framework for analysis of variance of coupled system. And this is employed as a rigorous statistical methodology to partition out the total variance observed in the performance of a multi-agent compound AI system.
So, you started first, you have the modification here. You look if you modify the prompt of the of the agent A, what is the variance of the complete system? Then you see, okay, then I have here a modified the prompt of the agent B, I look at the variance, what is happening to the complete system to the performance indicator.
So, this means you run a statistical observation and they tell you, yeah, at the end you know about do you have a coupling or not? Can you go linear or not?
Now, they have one single paragraph under explanation because I thought, why? Why are those interaction between those interwove agents weak?
And they tell us here the instruction tuning and the reinforcement learning by human feedback that trains our AI models to produce your consistent output across diverse input phrasing. What they do, they factor they have is they effectively compress a wide range of input structure style configuration into a narrow output distribution.
So, this means all the instruction tuning, all the reinforcement learning, yeah?
It doesn't it seem to be so sensitive to the real explicit structure of a prompt because those, let's call it more modern model, are so, let's call it intelligent {quotation mark} that they understand all those variation and they just do their job and they produce a narrow output distribution here independently how complex you query the system, yeah?
Mhm, okay.
Either we can go and accept this, but I just wanted to show you when I do here my test, my reasoning test. And this is a screenshot from my video where I tested MiniMax just to show you, yeah?
I have here my human prompt that is not optimized. I use it now for years, so this is really just human chaotic prompt, but you see that each system is analyzing, yeah?
Say, we start, yeah? Okay, the floor zero to start, I have a certain energy package, I have certain tokens, I have certain code structures.
Now, the task is we need to track state after each press. The rules are complicated with many interactions, so let's try to understand now what it what I have to do as an AI. Let's understand, this is here an example, you go in an elevator from floor zero to floor 50, you have to press certain buttons, special triggers.
And you see, then the AI system here, and this is here an open reasoning trace, you see it is trying to understand what is happening. So, it is let's say rewriting its own understanding in its own whatever it learned in the pre-training pattern recognition and it tries to find similar pattern to understand what is going on, what solutions to apply.
So, it seems that every AI is kind of rewriting here the human prompt in any way.
So, this means if the system really understood your intent, what you want the AI system to do as a human the system will then analyze it itself and come up here hopefully with the correct intent that you have.
But, let's focus here on this one special case.
So, this pre-print argues that the prompt optimization is just a coin flip, it's 50/50.
Unless, and this is now important, there is this issue can but doesn't do pattern.
Now, let's focus only on this special case. So, this is a latent capability in the AI model that requires a specific structure to unlock.
And I thought, this is interesting, I want to explore this. And then, it was just a coincidence that on the same day another paper was published that exactly unlocks a specific structure.
So, this is the next paper I want to show you here in this video and we look they will have a query optimizer. And this query optimizer, in my opinion, is exactly this exploitable structure that the first paper here says tells you that you need this, yeah? So, instead of hoping for global prompt optimizer that is by a little bit expensive finds here the magical value for all configuration instruction for your prompt.
They tell us, wait.
For particular domain, for specific use case, this new methodology builds now hold on to your socks, a deterministic architectural stage to force now our AI model into this can state of the can but doesn't pattern, so we activate now this can state in this model.
Interested? Great, let's have a look.
Second paper is here, Hong Kong Polytechnic University Affiliate Hospital South University Haikou, China and Guangzhou, China University Cancer Center.
They have a beautiful idea and at first I did not really understood the technical implementation and how beautiful it fits in the first paper.
They just go for a primary health care assistant in Hong Kong.
And this is, okay, it is a RAG-enhanced, it's an LLM framework and beautiful and I thought, okay, let's start to read the paper. And I was absolutely surprised because if we look closer they okay, Hong Kong, beautiful, everything automated.
They say, we use now a three-stage pipeline that is exactly following now the recommendations of the first pre-print that I just showed you, yeah?
They optimize now the agents that they will deploy here for the primary health care system.
They will optimize those agent independently.
And yes, I already posted this here about an hour ago here to my community telling you, have a look at this. And more or less, this was intended here just to tell you, look, RAG-enhanced LLM frameworks, if you do it here on a city-wide or district-wide level, I don't know Hong Kong.
There's really some interesting implementation here, especially if you do it here on a governmental level.
But, it turns out if we had a closer look it perfectly fits with the first paper.
This means the AI agent prompt don't need a joint optimization if you want or if you work in multi-agent configuration.
Those agents just need to do their independent job real well.
This is the whole secret. So, do not invest in a joint prompt optimization structure, just optimize here for the congruency and the coherence of the system.
Because they now start to separate here in the second paper the pipeline into the what we know as a triage.
So, they go with an intent classifier, we have a detailed look at this. Then they have a retrieval system, that's a RAG system, we don't need to talk about this in detail. And then, interestingly, they have the synthesis level, the reconciler and we have a look at the reconciler.
And they are faced with a problem. They say, okay, we have a fragmentation of health care information in our country, in our region, in our district, in our city, whatever.
So, all these official guidelines to hospital, the government, the institution, the whatever. So, they start to create a centralized knowledge repo that covers here the major primary health care information in Hong Kong.
Including now all clinical standards, all this is interesting, community resources that you have. All subsidized scheme where the government helps the citizen if you have a certain illness or whatever. Serving here as a knowledge base for direct process. So, this is then extracted here for particular region, everything brought together. And I like this.
Then you see the framework processes here the initial user input. Let's say this is me here and I have a very stupid question. I'm not a medical expert, I just tell you what is my what what I have, yeah?
And now the system has here with a little bit of an intelligence to understand what is the intent of the human user, now query optimizer.
So, you see all those system naturally have an architectural implementation of a query optimization.
So, if you as a human now maybe you don't have to spend too much on the perfect system prompt, on the perfect user prompt.
So, because those systems already have query optimizer built into.
So, simple, is it complex to understand here the user, yes or not? If it's no, yeah, then can go here and I will show you this in detail and we build subqueries and then we can send the subqueries out in the internet or our local knowledge retrieval. We have a semantic search, we have a keyword search, we have a re-rank, everything that you know from the classical RAG system, yeah?
Beautiful.
If it is real simple then we just try to answer it ourself.
Maybe we go to the local RAG only or maybe we have some other memory system for this. Beautiful.
But then, if it was complex and we have, I don't know, eight or 10 parts that come back now from the internet or from the local knowledge retrieval then they have this reconciler. They have now a particular reasoning model that brings together all the pieces of information and they construct now an official report with some governmental resources especially mentioned. So, this is if you want here also maybe limited only in the search process or prioritized here to the official governmental sources or to the hospitals or to the intensive care station that they have a whatever is the medical infrastructure Hong Kong. So they say prioritize those elements, no?
In what city district on a Saturday afternoon can you have access to medical emergency?
So you understand, beautiful.
Summarization, yes, I explain to you after sync.
Now they say we tested out general LLMs and they are not optimized for primary health care application, no? Because when presented with a health related query, those models here often generated generic recommendation, such as directing user to the emergency room or to any hospital, no?
But this is not what we want. And also if you go here for a DSP, DSP has no information about your local Hong Kong special infrastructure or whatever you need to you knew about it, no?
So does it make sense to go here to general models here for the whole planet where there's no not enough specific information or maybe if you go to a city to New York, yes, you maybe there is something in the general LLMs, but maybe this information is outdated or whatever.
So they say, "Okay, let's go and build a query optimizer, no?"
So instead of using now a stochastic optimizer like a DSP to search for the better prompt configuration and the context optimization, they decided to go now for an agentic query optimizer.
And this optimizer is designed to interpret the user initial input and transform it into a set of subqueries that the system understands now. Aiming to steer the search process toward relevant primary care option such as family doctors here or community health centers rather than defaulting to secondary uh health care institution or the emergency hospital, no?
So therefore in my words, this is simple task, rewrite the fuzzy human user query and generate precise subquery that the system is able to handle.
Then we have the intent classifier. So similar to the triage system in an emergency room where the nurse first confirms the urgency of the patient issues, this method uses now an LLM based classifier to process the user initial input at the start of a conversation.
Two examples. A simple query, as I showed you, "Hey, what is the address of Queen Mary Hospital?" can proceed immediately to the next step, no? But if you have a query that is a little bit more complex, tagged by the system, no?
"Which clinic has a better dental service in Hong Kong?" This is now routed to a clarification module. What exactly does the user want to know? How do I have to interpret this? Do I know the user? Do I have any information about this user? Has this user already been in our system?
You understand it.
Then we go to the multi-round clarification module, no?
If the query is flagged as complex, the system proactively ask clarifying question now to the human multiple rounds. They really want to understand the user intent, no?
So when a user mentioned where to find a better dental clinic the system helps them to clarify the actual meaning. What is What is the meaning of the word better for this singular patient, no?
Asking concern about price, something else about the quality, accessibility, availability, so this process here, yeah, human continues until the user's intention, the human background and the priorities are really clear for the AI system and the system says, "Go, next step."
And then we have an intent based generalizer. So you see what happened with the single human prompt where I asked for dental service in Hong Kong?
My goodness, they build now a whole architecture, a harness around our core LLM and this harness has such a beautiful structured task structure. So vanilla general LLMs improve coverage, simplify paraphrasing words, no?
But they want here an intent based generalizer that really is specific to the subqueries, no?
So example, where where there it's better to see a dentist in Shenzhen or in Hong Kong can be refined into searches like, "Hey, Hong Kong dental prices and Shenzhen dental prices, compare them." Clarification reveals the user intent is to find the more affordable health care services. What is her price?
You see the human prompt is now really preprocessed and further processed and now hopefully the AI system understands really what the user wants, no?
Moreover if the user expresses concerns about some other pain in the body this is now interesting because they don't just refer now the human to some orthopedic guidelines or whatever but they also go to district health centers in Hong Kong and some government subsidized care schemes because they say, "Hey, it is better to prevent than then if you have really something than to go to the hospital, no?"
So also a list of atomic search query to make the retrieval process significantly more comprehensive for the AI system, but with the intention not only a single highly focused, laser focused answer, but take care about the human as a whole, a complete body. Let's see what care schemes can be government subsidized, can be offered to the human.
If you want to see this here in a screenshot, so simple stupid user query and then they compare to a zero-shot LLM just using DeepSeek, no?
So they have a lack of domain knowledge that leads to hallucination by the AI system.
So they have a web only rag that they implement now with DeepSeek.
Less curated sources cause some context pollution is happening now, no?
But if they have a local only rag, no?
Maybe they have outdated information from the retrieve context. But if they go with their full D rag system, here everything is here for the region of Hong Kong or the city of Hong Kong or whatever, optimized, up-to-date optimization, you get the idea. What are the results?
They have a comparative performance of system configuration across five metrics that they analyze. Accuracy, completeness of the information, trustworthiness clarity and relevance for the human.
And there you see it. The first one is the zero-shot LLM here in this little light pink.
And in orange we have the web only rag system, less accuracy. Then we have the local only rag. Oh yeah, this much better. And then the full D rag system where we integrate everything from the web with a local database and everything that is important in your local district.
Just beautiful, no?
Interesting. I really like this study.
Maybe I cannot bring you the fascination of this idea, but have some governmental services where they say, "Hey, protection is better than we have to really bring you to the hospital, no?"
And you take care about yourself and careful and these are the recommendation, these are the guidelines that you should take care of and you have an online information system here with AI rag. I think this can be theoretically a real beautiful helpful system here for the community, no?
From a AI computer science point of view, I think if you read the study it confirms that in compound AI system the most significant gains you can get out of the system are found in improving the interaction between the system's model up component itself rather than you do just some individual prompt optimization or context optimization themselves, no?
So take care about the system configuration, the system coherence and do the system model component optimization.
What is my conclusion? I think the future of a company AI isn't to jointly optimize prompts. This is so old-fashioned, no black box pipeline, forget about it, no?
I think really if we talk about this harness around the core LLM, structural decomposition. Identify the bottleneck agent and build specific optimizers as I just showed you here in the second preprint, no?
To bridge the gap in if you offer some community service to some to everybody in your district, no?
To bridge the gap between human non-expert input when I ask any AI system, "Hey, I feel pain in my whatever."
And somehow this gap must be now compatible with a machine required expert data structure.
And you see how many agents were involved and what they build and how finally they came here to the best performance they can achieve here end of April 2026. I really like this study because it gives you an insight into the operational computer science timing and also here provide a service here for the community.
Last one, I think this really moves here the scale from the pure prompt optimization and context optimization here from searching for the best instruction set or instruction following complexity for the prompt.
But if we talk about harness systems this shows us we have to engineer the most precise input for our core intelligence, for our core LLM.
And in this harness fear around our core LLM, we have multiple AI helper functions.
And they have to work together as a whole to provide now the most precise data input, sequential data chunks, the relevant chunks here, the chunks that will refer here and not contradict the parametric knowledge of the LLM and so on.
So interestingly, it seems we are going in the optimization here of an complete AI system away from the best instruction set to provide to a single agent or multi-agent compound to engineer system engineer a ring of AI helpers around the core LLM to provide to the core LLM the most precise inputs so that core LLM can perform the best task for a specific human prompt.
I think we can learn so much from all these pilot studies that cities or government are currently executing or publishing about how they are building AI system for the community, how they reference this, how they integrate rag system, multiple rag system, how they build their AI harness around the core LLM. It is absolutely fascinating to see all the different technology trajectories that are emerging here from China to Boston to wherever you are living. It is just amazing what people can build with the AI system.
And maybe I somehow could transport here this fascination for AI over to you watching here this video. And I hope you got some new ideas, some new data, maybe some new configuration that you want to test out. And maybe I see you in my next video.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











