Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

RAG is Wasting 80% of Your LLM Compute Budget (How We Fixed It)
Hinzugefügt: 2026-05-16

484 Aufrufe15:23CorbenicAIOriginalveröffentlichung: 2026-05-09

In Retrieval Augmented Generation (RAG) systems, hybrid retrievers that search databases by both exact keywords and semantic meaning often retrieve identical text chunks through multiple paths, causing up to 80% of prompt data to be redundant duplicates. This redundancy wastes significant compute resources and increases inference costs without improving model performance. A deterministic, byte-exact deduplication engine operating at the infrastructure layer can eliminate this waste without any quality degradation, as proven by empirical evaluations across multiple language models showing zero change in output quality after deduplication.

[00:00:00]Tech giants are currently burning billions of dollars on computing power, feeding trillions of words into large language models every single day.

[00:00:09]We assume the pipelines delivering this text are highly streamlined, feeding these systems perfectly organized data.

[00:00:17]That assumption falls apart when we look at retrieval augmented generation or rag.

[00:00:22]This is the process of pulling external documents, manuals, or internet pages to build the prompt for the AI so it actually knows what it's talking about.

[00:00:32]To find the right documents, modern pipelines use hybrid retrievers. They search your database by exact keyword, and they also search by general meaning.

[00:00:41]Because these two search systems run parallel, they routinely grab the exact same paragraph of text through two different paths and stuff both copies into the final prompt. The language model is completely blind to this. It computes every single word it receives from scratch, burning expensive processing power to read identical paragraphs over and over and over again.

[00:01:05]This chart shows the composition of a typical prompt payload in an enterprise setting.

[00:01:10]When you analyze corporate setups parsing Wikipedia articles, white papers, and Q&A forums, 24% of the text hitting the model is an exact byte-for-byte duplicate.

[00:01:21]In long multi-turn chat sessions, the waste is even worse because the system repeatedly sends the entire previous conversation history back and forth with every new message. The redundant data balloons to 80% of the total payload.

[00:01:35]You might think you can fix this by having another smaller AI summarize the prompt first, but running an AI to compress data for another AI introduces heavy delays. Even worse, those learned summarizers occasionally hallucinate or permanently delete critical facts.

[00:01:52]Stripping out this invisible waste requires an entirely different approach.

[00:01:57]We need a filter that operates flawlessly without guessing, running at the raw speed of the server's processor.

[00:02:04]That brings us to the Merlin engine. It is a single stripped-down C++ executable file measuring exactly 3.8 megabytes. It sits directly between the database and the language model. Its only job is deterministic, byte-exact deduplication.

[00:02:20]If two chunks of text match perfectly, down to the final byte, it deletes one of them before the prompt is ever assembled. This diagram maps the engine's architecture. Data moves through five stages, from ingestion to lock-free dispatch. Zero machine learning happens here. By skipping bloated languages like Python, data routes directly through an L2 aligned memory arena, where multiple processor cores sort text simultaneously.

[00:02:45]Processing a standard retrieved payload through the system takes a median time of 1.10 microseconds.

[00:02:52]We have a piece of software so small and so fast, it practically vanishes into the system architecture, scrubbing the data pipeline without consuming meaningful system resources.

[00:03:03]This chart illustrates where that speed fits into an AI pipeline. The top bar shows the engine's 1.1 microsecond processing time. Compared to the 10 to 50 millisecond preparation budget below, the logarithmic scale reveals a massive gap. It runs four orders of magnitude faster.

[00:03:20]But violently stripping out 20 to 80% of a prompt's bytes before the AI reads them raises an obvious concern. Does removing that much context damage the model's ability to provide an accurate answer? To find out, researchers ran an empirical evaluation across four top-tier production language models, analyzing 22.2 million passages of public data to test for any drop in intelligence.

[00:03:45]This forest plot tracks the change in output quality across 40 distinct evaluation tests. Every single data point hugs the zero line. The average quality change across the entire multi-vendor sweep was exactly 0.0 percentage points. The statistics confirm the engine is completely lossless. Engineers can shrink the size of their prompts, radically accelerating the entire system with mathematical certainty that the AI's final answer remains pristine.

[00:04:15]In a pipeline utilizing this architecture, the redundancy is gone. AI models no longer read the same paragraph three times over. The data streams arrive purified and the generators process the text substantially faster.

[00:04:29]Because cloud servers charge based on how much data they process in that initial reading phase, cutting the input payload drops the per call computing costs. It also drives down the time to first token, meaning users get their answers much quicker.

[00:04:43]Right now, the dominant strategy for handling AI workloads is simply buying massive clusters of expensive graphics cards to brute force through whatever bloated text gets scraped together. By moving an exact byte matching filter upstream, we introduce a necessary efficiency.

[00:05:00]These massive probabilistic neural networks are incredibly powerful, but they operate best when they are fed by strict deterministic data engineering.

[00:05:10]We spend billions building the most complex, unpredictable intelligence systems in history, yet optimizing them comes down to a microscopic 3.8 megabyte piece of purp

Ähnliche Videos

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

Trends

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29

Künstliche Intelligenz

DOOM Runs On Everything...except Neo Geo

ModernVintageGamer

143K views•2026-06-01