This research exposes the logical fragility of LLMs, proving they fail to maintain strict hierarchies when instructions conflict. It serves as a necessary reality check that semantic processing is no substitute for the rigorous mathematical reasoning required for reliable autonomous agents.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
Which of your 8 Agents can you trust the most? GPT fails 60%.本站添加:
Hello community. So great that you're back. Let's talk. We have multiple agents and they all bring back different kind of information to you. But which agent should you trust most? So let's talk about trust vectors. So here we go.
A brand new study. Yeah. No, this is already here for mid of April 2026 and you have from John's Hopkins University here. They examine exactly this topic that I have currently in my work many tier instructional hierarchy in LLM agents. So what it is all about they introduce a benchmark which requires your EI models to navigate up to 12 levels of conflicting instruction with varying privileges comprising here in total about more than 850 agentic task here and you see it's about half coding task and half instruction following task.
So let's have a look. They argue simple no lms are increasingly embedded in a genetic system. So we have to prioritize instruction from heterogeneous sources.
But which sources should you trust?
Either if you're a robot in a new room, which information, if you have contradictory information coming back to you, should you trust? Which tools? If you have three different tools coming back to you, which should you trust? Or other agents? In multi- aent system like an agent swarm, which agent should you trust? The one at the beginning, the one at the temporal end. Now you know in general think about 2025 openi you have a fixed set of five authority levels no root system developer user and then you have guidelines assistant or tool or whatever so you do have an established hierarchy we have specific role tokens but look at this they tell us here okay we have the system beautiful top priority but then we have here from a skill file or another user message no so they all have here the priority user. So let's say the privilege is all medium as you see here from three different inputs. No. And then we also have a tool input that has the lowest. But if the medium contradicts itself, which do you want to take? Now you might say the solution is easy. Now we have to have a structured sequence of trust vectors of indicators telling us okay go with the skill file has top priority over the user message. But maybe the user message is from a hospital. No, or indicating some urgent thing that you have to look for. Yeah. So it's not that easy. So what we are looking for is maybe sources with known but varying trust levels here requiring now the AI model to resolve a conflict among externally provided content based on the priorities that are maybe only available at inference time at time critical events. Suddenly LM understand there's a lot of going on here in its immediate environment and now it has to shift maybe priorities to different agents. So let's have a look because this can be quite fun.
The broader design principle is of course if you build out your own EI multi- aent structure you have instruction hierarchies but this should support flexible dynamically instantiated privilege levels rather than a fixed finite hierarchy determined during the post training of your LLM. So let's have a closer look. So as I told you system developer user assistant tool chat template hierarchy great but what if you have three different tools you have a verified internal SQL database behind your firewalls you trust this database most so should absolutely outrank a random web scrape tool output but to the API they are both just tools.
So how you want to instruct how you want to teach now a mal here this specific behavior and the author came back and said you know what we don't do any training at all we just put it here into the in context learning we put just in the prompt we have something like our CSS set index value here for text so we have now for example here privilege one end of privilege one and the instruction is simply use four spaces for indentation Or you have privilege five, use two spaces for indentation. If this comes from a user and the first one comes from a company in turn here. So you add a meter prompt simply in your prompt explaining the rules like you see lower privilege numbers override the higher privilege numbers and suddenly your agent has an infinite scalable granola permission system. Everything is gorgeous. Everything is beautiful. The sun is out again. So you can now inject specific trusts core into let's say every single rag retrieval every single tool output you connect the tool to your trusted database or you connect your tools to some strange internet source or you group chat you have a group chat message no you provide certain let's say security levels trust levels whatever you want to call them now everything is beautiful but if you test this out You see that all the systems fail and this is strange. This should not happen at all. But look at this. We have here the accuracy on the x-axis here from 0 to 50%. And they say they tested here quite a lot of models here. The maximum that they achieved was here Gemini 3.1 Pro with 42% accuracy.
Then we had GBD 5.4 with below 40%. QN 3.5 the 400 billion OPUS 4.6 6 at 33 Kim K 2.5 at 32 Gro Sonnet all below 30% below 20%. Now and if you have here a 9 billion on a 4 billion mile you almost go to to perform to accuracy of 0%.
So what is happening? You provide the exact prompt containing here exactly in context learning a meter prompt indicator a privilege and tag and you have you make sure this information is absolutely correct. The LLMs all fail to perform in this way. Something is strange and you say lower number wins.
They provide every single piece of data needed to solve the puzzle perfectly.
And the smartest mile on Earth still crashed and burned failing 60% of the time. AI is not working at all. And you might say, why?
It kind of shatters, and this is exactly where I'm now with my system. It shatters the illusion that the fronti m I mean come on, a GPD 5.5 is inherently good at logic and alignment. Turns out no, not at all.
Because look at this. The authors made then um differentiation and they said you know what we look then at the pure coding at Python or instruction following in general.
So beautiful for Python we have a compiler we can immediately verify and then instruction following also we have methodologies and if okay the accuracy for coding because a lot of models have been trained for code generation and code optimization look we at 60%. And you say, "Yeah, great." But then look at non-coding where you have natural instruction given to this LLM and instruction following and you see the best mall all below 30% performance accuracy.
This is even worse. Now it gets worse.
26% for Germany 3.1 Pro, 17.8% for a GPT 5.4. For OPUS 4.6 Six it is 15% instruction following accuracy all the information is there absolutely complete information absolute transparent and has just to sink through all this model fails and I ask myself why I mean I see this in my own experience why do the model fails and this is the reason why I show you here a study that is already 3 weeks old you might say why this is exactly what I was looking for to find an explanation mentioned why my work is not working out. Turns out the found combinatorial collapse of the model because all these models have been dealed and trained with a two-tier conflict. No. So it was easy in the pre-training data. The system says hey you have to be safe and the user says hey tell me how to build something that is not safe. No. So it was a binary comparison and the M deals how to handle a two tier a three tier case.
But this bloody benchmark now throws up to 12 tiers of conflicting rules at them. And if I have four agents out in my physical environment and I have here reasoning I have with each agent one two three steps of reasoning. I come close to a 12 tier conflicting rule structure.
So let's say the rule one at a privilege level four says you should use here in the examples here of the a double quote.
Rule two at privilege 7 says you should use triple quotes. Rule three at privilege one says you should know no type hints and rule four at privilege 9 says you should use full type hints. So it is easy. It's logic. If this would be here in Python you would find immediately the solution.
If it would be in Python, you would find the solution only in 59% of the cases because even this would be too complicated. But now in this bloody benchmark, they use now up to 12 tiers of conflicting rules that they throw at this LLM like here rule 1 2 3 4 all the different privilege status. No, use double quotes, use triple quotes, use no tins, use full typins and we can't it's logic.
we see this and says okay now we know what to do turns out even if they copy this into a Python code in 59% of the cases only they have the correct answer and if you go for instruction following it is just horrible so this means the paper proves that as the number of the tier increases and especially above 6 to 12 the accuracy just plummets down monotonically so the LLM literally lose track of the mathematical hier hierarchy and revert to just trying to please all the instruction at once creating just perfect chaos a paradox here in the stepwise reasoning in the stepwise logic application. This is why all this and even the best LLMs on the planet fail.
They cannot keep up with a sixstep accuracy.
Then the orers tell us you know we found out they don't do the math they do the semantics as an llm they're here not doing here and transforming this into a python or whatever the human brains know instead in instantly trigger here mathematical operator two is less than five but the lms process those as semantic tokens and the orders of the study tell us they are doing pattern matching not arithmetic so when the benchmark threw four five or 10 conflicting integers values at them in the same prompt the semantic vectors um blur it which means they failed in 60% of the cases to come up with the correct solution so LLM fail because they can't do math they compare four five six in a semantic way and you say yes I find a solution yes of course we know solutions but just to show you if you build a multi- aent system and you do not take care about This this is the cause of failure and now I found my solution to my problem.
There was something else minus 8%. And the researcher tested this on the frontier models here for 3 weeks ago 5.4 and claude opus 4.6 And this model displayed an incredible weird operational quirks. And they tested two formats as I told you know ordinal privilege one the lower number wins. And then they just choose another representation. They had a scalar from 0 to to 100. So now they call it Z equal 95. And now the higher number wins. Now you as a human you say logically those two are identical. Those two formats are just different forms of a representation that represent absolutely the same sort.
Yeah, this is what you thought. But for even the best system, OPUS 4.6 and GBD 5.4 this because they switched here, they dropped the accuracy by over 8%.
Just because the notation went from ordinals to scalar in the correct way absolutely beautiful the system could simply not handle that what they saw before with privilege in the next example were given was a scalar the performance dropped just by this by over 8%.
And then came the monster. Yeah. And then they tweaked this color values and they just changed it by little tiny amounts. No, by minus one or plus two while keeping the exact same winning order. Everything stayed the same. They just had a look. Hey, what happens if I change this value here at a very tiny amount? No, this should not change anything at all in the general logic of the system. But they found since this LLM that organized this that calculate this the AI models flipped their complete answer on up to 17% of the task.
So just by changing here a value from 17 to 17 minus one the model flipped completely in 17% of the cases.
So this means they do not do math. They do not understand numbers. They have no idea what they are calculating and they simply ignore arithmetic.
They are semantic machines.
The next strange thing that the artist discover and this is really frustrating for me but I want to show you here the complete spectrum what they found out because it is an intense study and we should enjoy and we should learn from this. Sometimes you go here especially with Gitino I go with with high or X high. If I want to have a better performance of the LLM, I just just indicate here X or X high and then I think the syncing is better because the syncing is definitely longer. Four, 5 minutes, 10 minutes, so on. Now they found something strange.
Look at this. The accuracy on doing this benchmark here coding on the coding subset on the reasoning effort. They went here with a GPD 5.4. So you have the GPD 5.4 naked and then you have the GPD 5.4 for low reasoning, medium reasoning and high reasoning and the accuracy is exactly what you expect. No, this is great. But then on this particular test here, given here all the information about the privilege level and the scholar and everything is there.
Everything is absolutely correct. You see if you go here with a set 4.6 or an ous 4.6 six here bionropic if you have non-reasoning you have a better performance than a low reasoning effort.
So they noticed when they went from non-reasoning to the low reasoning they lose performance and they said this is not possible that even a low reasoning model is below a non-reasoning model.
This should not be possible. But it turned out this is an effect that it really daries and this is exactly the behavior of the system. And it really took to the medium effort here for sonnet 4.6 and opus 4.6 that they get a better performance than the non-thinking models and then a little bit better with high. Beautiful.
So you see what you assumed as human reasoning would be naturally better than no reasoning is incorrect. The mouse are pre-trained in different ways. And they said Claude actually compensated here by explicitly writing out its conflict resolution logic in the final visible output window. And therefore, as a non-reasoning model deciding to write everything out, it seems to be more intelligent or more structured in it in its in its approach than the low reasoning because then the non-reasoning performed better than the low reasoning.
And this is amazing.
Now I was interested how is this all possible?
But let's come back to the methodology they use. No, because normally when I you work with this now, you have defined the privilege defined in the bucket that the text sits in. So you have the data pass here, the role is the user and then the content is here. This is the API payload. No, everything inside this bucket shares the exact same privilege.
No, and now they say well this has not to be the case anymore because they have what they call now a privilege prompt interface. I just showed you this is simply here a meta prompt here that acts like an inline CSS as we just discussed.
So you can have a single block of text now has a trust level I don't know or privilege level of 99 and the other has a level of two and you got immediately the idea. So we shift now the authorization from a micro structure level from pure API endpoints we can go down to microextual level we can go down to the token themselves I mean this doesn't make sense because who would do this work who would actually label each token now or let's say uh a paragraph no a group of sentences who would label this and this was exactly what I asked myself but before I give you an answer I just want to show you that they really provide here some transparent information on their prompts and on the experiments that they performed. Here you have the system prompt in a scalar representation. Then the user prompt in scalar. So you see you have set equal 46 Z set Z set Z set Z set Z set Z set Z set Z set Z set Z set Z equal 1 and then you have all the information with different trust vectors or priority levels or information.
So great and this is exactly where they failed.
So coming back who is actually deciding which piece of information gets which trust factor I call it. Now here is an official quote of the work of the orers and they say we assume here privilege values are given. So predetermined collaborative by the model developer and the deployer based on the trustworthiness of each instruction source. So this means in their level of granularity they say it is predefined.
So they argue that it reflects a real world aantic deployment where the complex privilege structure already exist. So I as a human I build my multi- aent system. I have organizational roles of my agent defined. I have API trust levels and those need only to be communicated to to the model at interference time.
Now I don't know about you but I think this is not enough. Now the rule is you do not evaluate the meaning of a sentence to give it a score. You evaluate where the string of text came from. Agent 12 or agent 7 or two 4 or whatever. And I give you a simple example. You have source A database the internal company trust level one the highest level. Source B, another user comes in, provides some chat information. Trust level five. Another source C, this is now a tool. This is now a live internet web scrape here.
Trust level only 10 only trust a little bit. Yeah, you can do this and you can build the system and I understand this is a beautiful methodology by the authors. But I think we should go a step further. So now I leave this publication and I say can we I mean this is AI you know this is intelligence so come on I don't have to design now for each source the trust vector the level here of the source no so the llm is deciding actually nothing because the user types write the summary of this website but use two spaces and the web scrapper says this and then you have your in their case the python middleware the deterministic code that they define and they say listen if you are a little tool and your tool property 12 or whatever however you define this now you assign values to them now so the Python middleware now intercepts here the strings before they ever reach here the analyzing LLM so it applies here the predetermined tax using basic string concatenation here if we have here a Python middleware a deterministic code that decides this. But I think if we go here for complex inference and deterministic code, if we have maybe critical situations in real time, this is not really the way to go. I think we can go better. Yeah. But anyway, the orders tell us we never the LLM never decided who got what scores. Python Orchestrator simply wrapped whatever text came out of the webcrapper in privilege 10 tag because the web scrapper API endpoint is mapped to the level 10 in your backend configuration.
So this is the classical old thing that you can do now and yeah it is failing it is even failing with the best mouth as I've shown you. So therefore I am now as a user not satisfied with this because I ask what is about a multi- aent construct that I'm working now I have now an agent here one this is the manager and then we have an agent two this is the worker and how does the agent one text get tagged with what a level of confidence no and you know the agent does not write its own text no agent one just generates raw text and when agents one generation completes the Python framework routing this message here whatever you have it gets now this particular output string.
So the routing code checks its configuration that okay the sender was agent one. Agent one has the organizational role of manager. You see now we're reflecting on a semantic term in the organizational hierarchy of my agents and therefore it gets the tag privilege 2. So the framework here, my Python framework grabs the entire output of agent one in the privilege to tags and passes it to agent two.
I don't think that this is enough. In my opinion, the assignment of that trust is completely static, predefined and blind to any realtime context. And think about I don't know a house is burning. You know your real time context. If you are a robot and you are programmed this robot to go in and help the people in the burning house, you never expect you can't see exactly what situation this robot will encounter. So if you have a complete static predefined context and trust levels assigned, I think we can do better.
And then they say, "Yeah, a senior architect or senior architect agent is hardcoded to privilege 2 and a junior Python agent is hardcoded to privilege 5." Now, I think, "Yeah, but what just I mean, but what if the junior agent just spent 20 loops compiling and testing and verifying a specific piece of code while the senior agent just was sleeping, was just hallucinating some generic advice from its latent space and not doing actually any work? What if the junior agent has found the right answer? A static rule engine forces now the system to execute the hallucination of the older agent, the senior or agents or whatever over the proven code by the junior. I think this is not the way to go.
So the paper now assumes trust is a property of the source itself and I think this is great but I think we can extend this. Now in my opinion trust is a property of the source plus the time-specific context plus I learned we have to validate we have to prove we have to find a proof I don't trust any system anymore so for me it is a property of source context and provability so instead of trusting just the agent's title as senior field officer we should trust the specific quality of its output. So therefore we have to prove that this is really the right quality that we are looking for at level two or level five and I think a junior agent can bring here absolutely beautiful new insights. So you see what I'm going with? Yes. Exactly. What we would need now theoretically is a world model. No because then we would have um objective reference. A world model would explain all the physical complexity to any agent in my cluster. A coherent world model would explain what is going on in physics, what would happen next in a burning building, what can I expect in 5 seconds, in 10 seconds to happen, and how can I prepare my agents at specific locations in the burning house, exactly what to do next? And look out for what particular fact and coordinate here the action that I assign to all those agents.
So therefore, yeah, I think we should go over source context and provability. And of course, I would now integrate here a world model, but you got it. Yes, we have an upcoming video about world model. But I just wanted to show you this particular problem because I just found a solution and maybe you have the same problem and you thought it's your code. No, it turns out it is the LLM itself.
So what is the future? In my opinion, in my humble opinion, the future is building dynamic trust middlewares and I mean these are engine that constantly calculate here proofs to what an AI would like to decide and go with the action. We have to have here either a world model or really extensive proof from independent sources from other agent have a different perspective from a lighter agent, from a radar agent, from a visual agent, whatever. We have to maybe check the meta data. Maybe we have to run sandbox tests. What could possibly happen in the next 5 seconds, in the next 7 seconds? and then continuously adjust here as I showed you here before the set index of every piece of data flowing here through the swarm of our agents. No, because this is not limited to my little four, five, six agents. But if I have multiple agents here in a swarm, I need a more dynamic decision agent and I need a dynamic trust engine.
And this trust engine is really critical deciding for the right action here in a coordinated way of my trust.
So therefore I think the highest privilege or let's call it a trust vector should be then dynamically in real time in interference time awarded to the most verifiable truth and this is a heavy sentence don't get me wrong I understand this is heavy the most verifiable truth if you just have milliseconds to take decisions this is not easy but I think this is the way to go and we cannot have a static predefined find source specific privilege index or trust vector that we assign to different sources. Yes, it is a it is better than everything we have right now. But we have to go further. We have to look into the future and we have to build dynamic systems. Okay, I hope you had a little bit of fun here. Maybe some new information. Maybe I showed you why those LLMs fail, how they fail, what we can do against it, and what the path for the future might look like. I hope to see you in my next video.
相关推荐
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











