AI agents struggle with document parsing because PDFs are designed for display, not semantic interpretation, containing machine instructions for printing characters rather than structured text; this creates challenges with table recognition, reading order, and complex layouts that require specialized OCR and parsing solutions to convert documents into machine-readable formats for agentic workflows.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?Added:
Hey everyone. Uh, good to be here. Uh, I'm Jerry, co-founder, CEO of Llama Index. Uh, for those of you who might not have heard of us before, uh, today we're basically focused all on building the best-in-class agentic document infrastructure. If you've heard of us before, we started off as a pretty popular open source framework. Uh, we've done quite a few courses with deep learning.ai I actually that you might be able to check out if you search in the Google history. Um, but basically what we kind of started off doing was really connecting LMS with various sources of data as an open source framework.
Throughout that journey, we realized that certain types of data, specifically document-based data, was really hard to actually ingest and get in the right format for AI agents to understand. And so we started building really deep tech around this area to enable AI agents, rag systems, you know, agent tech workflows to basically take advantage uh of a lot of the highquality processing that we do to basically get really high quality context from the world's most complex PDFs, PowerPoints, Word docs into a format that you know you can then feed into a downstream Adantic system.
So today our basic uh kind of core company mission is building the best-in-class aentic document infrastructure. basically being able to OCR the most complex docs and putting it into a format that you know your workflows can understand. Um you know we have a billion plus pages processed. We have you know 300,000 users on our platform. Um and we still care a lot about open source. more on that in just a bit, but basically focused a lot around basically creating the right document context so that your favorite AI agent, whether it's something you built your on your um on your own or whether you use like cloud code or openclaw um can basically have access to the highest quality context from your documents.
So, you know, I think basically this kind of sets up the premise. Um AI agents are kind of inhaling massive quantities of unstructured data. Um, the core kind of alpha of like any AI agent system these days is basically the context that you feed it. Um, but a lot of that data is stuck in just massive PDFs, PowerPoints, Word docs, basically Microsoft Office documents across the enterprise. Whether you're in financial services, legal, insurance, a good chunk of the working world still deals with just like a massive amount of paperwork and it requires a lot of human effort to actually read these documents, extract out information from them and then be able to like synthesize new information from that.
And so you know just to give you a sense of our product stack we have uh kind of our core commercial service which is llama parse. Um that's basically the core API that I'll talk a little bit about on the agentic document processing piece. Um we'll talk a little bit about why this is like a hard problem in the first place. Why you know document OCR has been around for like 20 years. Uh but really I think there's been a lot of issues with some of the historical techniques and especially with some of our capabilities we're able to bring it to just like a really high level of accuracy with low cost. Um, and then you know I think the agentic layer on top is of course once you have the document data be able to use it to help automate various workflows whether you're kind of a financial analyst or whether you're uh kind of on the legal side whether you're processing insurance claims um be able to actually help use the right agentic system to help automate the end to end work over your documents.
So I think a good chunk of this talk and the title of this talk is basically you know why is reading a PDF so hard um and that's basically going to be a good chunk of the focus of this section right now. So for those of you who might not be familiar with kind of some of the um or actually before that you know some some of the downstream use cases of docr are basically feeding it to um both like agentic workflows as well as kind of like generalized agent systems. Um, you know, if you were building kind of a rag chatbot back in 2023, um, a lot of the early Genai use cases basically started off with, you know, upload a PDF and give it to Chatbt to basically have it understand the data within. By definition, if you're trying to give a PDF to an AI system, you need some sort of OCR layer, whether it is, you know, screenshotting it, feeding it to a VLM, or kind of using one of these free and open source tools, or using a specialized parsing tool. There's a lot of human work that also happens just by you know armies of humans actually just reading massive piles of data whether it's kind of loan applications you know onboarding documents invoices receipts and be able to validate that the data is correct there's an entire legacy industry built around that called intelligent document processing or IDP um and these days everything's kind of consolidating and expanding because you know especially with with agents uh they're kind of able to both like handle a lot of the repetitive operational work over your documents but also generally kind of be the new knowledge worker uh within the enterprise. Um and so in order for that to happen you basically need the right document infrastructure to be able to process handle all these docs so that the agents have the right context to make decisions.
Okay now we get into okay why document OCR and why document processing is hard like why is reading PDF so hard? um you know a good chunk of the world's data lies in PDFs and the main issue with any PDF is that the format is not really written for kind of like a semantic interpretation for machines. If you look into the internal structure of a PDF, it's a bunch of machine instructions to help like print characters for display purposes on a page. And so it basically just looks like a bunch of random numbers representing coordinates plus like glyph symbols that you actually print on the page. and sometimes those map back into uh character codes. You know, the postcript format has been around since like the 1990s or so. And ever since the PDF format as a concept has like taken off, um there have been efforts to actually just try to read this information from a PDF back into some sort of like digitalized text. The issue is that LLMs obviously do not know how to read the native PDF binary on its own. And so the job of literally any PDF parser is to try to actually convert this format into kind of a clean semantic both human interpretable and also machine interpretable format that you can read. And so we'll trace through kind of the evolution of various approaches. But fundamentally I'll walk through some examples of like why this is hard. If you look at kind of like a table in a document you know let's say this is just like kind of um you know some some random table. You know this has like items revenue it could be like an invoice. It could be some sort of like SEC filing. Um, you as a human can read this on a document and see it as a table. But as like you know within within the file format itself, it's basically drawn as a set of like lines, borders and then also values with coordinate positions. The format itself does not actually do any association or labeling of like this entire thing as a table. um you know you don't actually know just a priori that this like cell text is within this like outer border plus like the row separator line without doing a little bit of like clustering and heristics to do a little bit of math. And so I think that's just an example. You know if you look at any of these tables drawn within PDFs they're often times represented as just a bunch of strokes plus character positions.
Um reading order is also another thing.
you know, if you have a two column uh format where you know, you have like a newspaper format where you might have like multiple columns on a single page.
Um yeah, you know, you as a human would be able to read this on a page and you probably read the left column first and then the middle and then the right one.
But if if you look at how it's stored in a PDF, it could be totally arbitrary in terms of the sequence of characters are actually stored. Of course, the coordinates when you actually like, you know, print it for display purposes will all kind of reconstruct something that you as a human can read, but from a machine purpose, um, you know, like the the actual ordering of the instructions is not guaranteed to obviously be in the same order that you would actually read the document.
So, the kind of like core idea is that, you know, document processing is pretty hard. Um this has been an issue that you know I think a lot of scientists and researchers have tried to solve over the past 20 years and I think it's become an increasing need especially in the era of AI because you know a lot of this data is within PDFs within you know enterprise organization let's say you're trying to build like a giant knowledge base of your Microsoft SharePoint data how do you actually convert the format that it's in into some sort of like plain text representation like a markdown format that you know um kind of both humans and also machines can read so that you give it to claude code or openclaw and it can actually read this data to actually do the things you want it to do. So I think that's the goal state. Um you know part of the idea is to actually um uh really innovate on the core OCR layer to be able to translate documents into again a format that machines can understand.
And so just kind of like generally tracing through uh you know a variety of different approaches um there's been again like a few decades of just research and work uh going on for like document understanding you know premachine learning the way you would actually compile like these types of coordinates into some sort of like clean text. If you look at the implementation of some of these parsers out there like um you know the the core engine like testact to some sort of free and open source parser like PDF to text or py pdf is you would basically just implement a bunch of huristics. um you would be able to like look at the text on a page uh do some sort of like clustering to associate text together and then use the spacing between the different characters and for kind of like flowing paragraphs and then also you would use like OCR models um which are kind of like specialized models designed for reading like images um into text and then you would figure out some sort of like way to combine all that stuff together. So you would like basically extract the text, segment the tables and then you would do some sort of heruristics to then like stitch things together into some sort of interpretable format. The issue with any heruristic based approach is that it's obviously brittle. Um it's hardcoded. So the moment like the document format deviates from what the heristic is able to handle, uh it's going to break.
There's been a recent trend of like vision models and you know pre-trigger there's been kind of these like small like almost like train models for document OCR2 like donut and and some other ones um but you know I think part of the idea is basically you have some sort of like document distribution you train uh some model um and the idea is in a oneshot manner you feed it to the model um and it'll give you back the the kind of reconstructed output of the document with the rise of vision capabilities in large language models. Um this started with GPD4 vision. Um now it's basically vision capabilities are present in all the frontier models plus uh a lot of the open source ones too. You know this kind of leads to these baselines where you can basically feed um you know like a a page into Opus 4.7. Opus 4.7 will just like screenshot the entire page and then try to reconstruct it as a set of tokens that you can read.
And so I think for us, you know, uh just to give you a little bit of a sense, uh we do a little bit of both. Um I think there's a lot of benefits towards being able to actually read uh really simple text pages uh pretty quickly. But of course in the long tale of just really complex pages where you have anything from really complex tables to kind of like dense charts uh to you know like handwritten forms with a bunch of scribbles. It helps to use some of the generalized capabilities of these vision models so you can actually translate them into the interpretable text format.
Just some general experiments from our side. Um, you know, I think obviously Frontier models have some init vision capabilities uh but they're they're not super well tuned for document understanding. Um, you know, there's Gemini uh 3.1 Pro, there's GBD 5.5, there's Opus 4.7. At this point, one thing we found throughout basically all these models is that increased thinking in the frontier models, if you tune the reasoning tokens, um it generally does not correlate to increase visual understanding accuracy because I think a lot of the post- training for these models is on coding and reasoning and kind of math and not necessarily on tuning the visual understanding capabilities.
In general, um using frontier models is a fine baseline. Um, it's usually pretty expensive if you actually try to deploy it as an OCR approach on its own. If you've ever gone into How many of you have used like Claude Co-work for instance? Okay. Or cloud code. Cloud code. It's like 90% of Okay. Okay. Cool.
I think if you've ever used some of the like native assistant agents of like cloud co-work cloud code, uh, the loop that you'll notice is it actually does some sort of like free and open source text parsing first on the document. once it identifies a page um it'll screenshot the page and then feed it into itself.
So if it's Opus 4.7 it'll screenshot the page once it finds what it's looking for and then feed the page image into itself to try to actually answer you know the question at hand. Um I think that's fine for like these assistive workloads where you know if you have a cloud max plan you're basically subsidizing like the token usage u of like you know um or sorry the the API users are subsidizing token usage for you. But in general if you're trying to run and parse like a million PDFs at scale um this is not going to work. It's just going to be way too expensive.
There's a lot of different types of data you can see. I mean the image might be a little blurry. there's kind of like these just like really degenerate tables out there. Um there's all these like charts if you're in the financial setting of like line graphs. If you look at how a lot of the existing models reconstruct this data, um you can see it the visual grounding and the ability to really capture every single fine grain detail even for some of these like frontier models is is uh there's still there's still gaps. It's not at like 100% accuracy.
I I think the other thing is especially for kind of like what um users typically want to build as like AI agents is they want some level of auditability and citations back to the source data. Um and especially from a user experience perspective. Let's say you're uploading you know like a million legal documents or you know just a giant repository of data and the agents reasoning through it in a data room and giving you back an answer. You typically want some sort of at least like region level or line level citations. And so having a good layout and grounding model is also super important. You know, being able to detect within a page that a certain element is a table um that there is a piece of text that you actually draw bounding boxes across pieces of text means that you know when the agent gives you back some sort of answer to help your research or your work, you can directly click on the citation and almost directly trace back to the specific region in the source document that it came from.
So this is something that doesn't come out of the box if you just call the VLM API. Um but is something that's typically required uh if you're trying to you know like um actually build some sort of functioning agent with the right citations over a set of documents.
The other uh component that I kind of want to talk about is um in general there's um benchmarking is pretty important for any model. Um you know if you look at like all these frontier models are tuned on like terminal uh bench and sweet bench and all that stuff. Um for v for document understanding most of the existing benchmarks are usually a little bit incomplete. I think it's actually part uh a little bit of like a work in progress. Um there's omni dobbench which I think is a popular choice for a lot of the frontier model providers as well as open source models. Um it's starting to become a little bit saturated. Um, and it's basically it's typically over kind of like some set of evaluation metrics that measure for exact syntic accuracy on your tables. Um, and so often times is a little bit too rigid for what agents care about. There's also uh OMO CRB bench um which typically focuses on academic papers and has like kind of binary methods for uh evaluation. So, you know, for a given value, did you actually parse um like the cell to the left and place it to the left versus the cell to the right and place it to the right? And then it's basically a binary pass fail test. It's actually not bad, but the data distribution still is not quite reflective of a lot of like enterprise document workloads and um the binary metrics often times are a little bit too coarse for kind of like deeper document understanding.
So, you know, ever since we've kind of narrowed a lot of our focus towards this entire effort around just like really deeply understanding um the capabilities of models and really pushing the frontiers of document understanding. Uh we recently came out with Parsbench um which is a comprehensive document benchmark for enterprise documents. It contains uh 2,000 human verified pages across a variety of different sectors.
It includes you know financial, insurance, legal. We really try to source documents um you know in the public domain that are reflective of like real world document examples and it basically measures across a variety of different elements that people typically care about. This includes just extremely dense tables. Uh it includes charts, you know, the line graphs, bar graphs, pigraphs. It includes content faithfulness, making sure you're actually linearizing the document in the right way. It includes semantic formatting. oftentimes there's just like really weird font errors or translation issues or kind of like uh you know document c or like word crossouts with it or strikethroughs. Um and so you want to make sure that the parser is able to represent even those like granular stylistic differences the right way and then visual grounding making sure that you have the right bounding boxes uh for your document and so you know this there's a full white paper that our team actually submitted on archive um you can take a look at parsbench.ai AI if you want to take a look for yourself. But part of this is actually it's a fully open benchmark where we not we of course like benchmark our own stuff uh but also a lot of the frontier models uh and open source models out there. So you can if you want to see whether Gemini, GBT or Opus like wins in terms of document understanding especially as the models improve uh there's like a leaderboard for tracking this stuff.
Um so I mean we we did try to be as unbiased as possible. Um because it's an open leaderboard. We basically take in any contributions from any parser. Um there's kind of a general distribution of various solutions on accuracy and cost. Maybe just to give you a little bit of a sense on the frontier model specifically. Gemini pro is uh like 8 cents plus per page. So it's super expensive. Uh this doesn't have opus 4.7 but opus 4.6 six is, you know, around like 53% overall accuracy. Uh, pretty good on tables, but not great on kind of visual grounding or charts. GBD 5.4 is, you know, maybe a little bit better with respect to cost. Um, Gemini tends to actually be relatively competitive, especially if you turn the thinking mode off, uh, especially the Gemini 3 flash models.
If you guys are interested, there's a ton of, uh, resources that we actually came out with. There's a hugging face leader award is on Kaggle. Um there's an archive white paper. Um there's also a website and we're gonna come out with kind of expansions of the benchmark too and just like add more data to it. Um and there's a lot more tasks that we want to solve. We want to make sure it fully covers the entire distribution of like enterprise docs. We want to make sure we measure handwriting. We want to make sure we measure like forms and scans. Um but just want to put this out there as one of the efforts that I think our team is putting out.
The other thing that I'm generally pretty excited about is uh this thing called Light Purse. And uh we kind of released this about a month ago, but you know, unlike our commercial service that we uh kind of make a little bit of money on, um this is completely free and open source. And so if you're kind of familiar with a lot of our recent like uh open source tooling, you know, we started off as an open source project as a very broad framework. Um, this basically represents a continuation of our open source strategy, but really focus on providing the best document-based tools to AI agents. And so it's really um, besides being open source, it's also very fast. It will be able to extract text from your documents and actually doesn't use any VLMs at all. So uh, the comparison points are basically like uh, PI PDF, PY PDF, uh, PDF to text. a lot of these open source free parsers that typically if you open up like cloud code or cloud co-work in the training data it will try to use these like fast parsers as an initial pass before doing some like deeper VLM based pass and so we really wanted to create something that we thought was did perform a little bit better than these other model free parsers it needs to be fast we're kind of working on a rustbased implementation right now and it's also oneclick installable as an agent skill with native support for OCR and screenshot ing. And so what this means is um if uh you know you basically install this as an agent skill and you ask openclaw to basically do some deep research over you know a data room of like various types of documents you have stored in a folder it'll be able to use this tool to do like a very light and fast OCR pass over your documents be able to reconstruct it in some sort of human readable format and then be able to kind of um uh reason over it and do some like deeper VLM based analysis if needed. Uh the video for some reason isn't loading otherwise I show you kind of what the outputs are like. However, I have another video which I know the TV screen is not super big but this basically shows you that with this tool we created kind of this like cloud code skill that you know given some repository of docs that you want to do deep research over. This will generate like a full HTML report for you using cloud code. You just go in and you know it's free too so you can basically just use it now. um and it'll generate a full report with citations and if you click on the citations you actually get back the full cited quotes with the highlights in the source document. This is not really something that you uh get in the um with just like naive kind of rag based implementations and actually it's it's actually I was surprised that it could actually do this. It uses some of the native bounding boxes from uh light parse and the agent harness is smart enough to then reconstruct it into word level citations and bounding boxes.
This is completely free. It's open source. Um you can check it out if you want and it's meant to be used as kind of like a first pass uh for your assistant agents um if you're interested in taking a look.
Now obviously you know like what's the limitations of this? Um it doesn't use any models. So it's not going to read some of your more complicated documents.
um it will do a handoff to OCR kind of some VLM based approach but it's really meant to be the first pass for you know cloud code or open cloud.
Um Simon Willis also gave a shout out about it last week which I'm personally very happy about. Um, but just to give you a sense of what the output looks like, um, if you kind of have some like pretty complicated like multi-document format, it'll kind of just like lay it out in a bunch of text. The we have like an entire blog post explaining how it works. But imagine you like read the text on a page. We're basically we will basically try to reconstruct the way the text looks on the page with a bunch of tabs and white spaces so that it's both human interpretable and also agent interpretable.
So besides that, you know, I think I kind of talked about this, but just like in general for um once you move back into kind of like the VLM based world using visual models to help like document understanding, um what are the components you need? You basically need the some sort of core models um to be able to read various types of elements on a page, tables, charts, and all that stuff. And then you basically want to stitch it together in some like agentic harness so that you use certain models for certain tasks like layout charts tables and then you know you kind of orchestrate all of them together prompt tune fine-tune the models themselves and then you basically create uh kind of a final output to the user complete with all the metadata that you need including citations and bounding boxes.
Just to give you a quick sense the the thing that we've mostly talked about is parsing which is translation of document to markdown. This is kind of the thing that will give you basically kind of a document to some representation that your agents can just natively understand. There's kind of some like deeper layers to this too like document extraction. So uh if you have a structured schema and you want to actually extract like a million documents at once um you know and and output it in some sort of like structured JSON that you can ETL into Snowflake data bricks uh etc. Um that's also an endpoint that we have too and it's basically a layer on top of the core OCR. you have some schema, we'll have some like agent that will go in and actually uh fill out the outputs according to whatever schema that you define through natural language without training involved.
Um, some of our customers include, uh, you know, Carlile was one of our earliest customers, um, helping with kind of end-to-end due diligence agents over data rooms of documents. And so, um, basically just helping to really automate the process of scanning a bunch of docs within a data room. Um, and be able to really save the human analyst time and helping to pull the financials together and creating the financial model.
um Sex um you know is is another customer and I think ever since the beginning they've been creating just like massive uh kind of these like um massive numbers of specialized agents over all their company knowledge is in Microsoft SharePoint and so being able to index a lot of this data extremely quickly parse it with llama parse and then also store it in some sort of downstream storage system and enable their end users whether it's customer support operations or internal teams to be able to get insights a lot quicker from that data and in a very accurate banner too.
How much time do I have? Um, okay, that was like 25 minutes. Um, maybe just some quick note on the general macro macro market. You know, I think, um, we started off as a rag framework. Um, and so, you know, we we still have some decent amount of insights into kind of like the general agent space. Obviously, we've kind of found our focus a little bit as like kind of the core document infrastructure that plugs into a lot of these agents, but you know, I think there's some interesting trends we're seeing with like how agents are going in general. In 2023, I would say, you know, most people were just building like basic rag. Um, I had a course with uh Andrew Ang um in 2023. It was very exciting. Um, and it was basically on advanced rag techniques and you know, we just had a bunch of stuff in the open source that helped people build more advanced ways of like retrieving from your docs. As we went into 2024, people started getting interested in agents and more specifically kind of like really basic agents, agentic workflows that could start automating kind of like some some initial actions um without like with with the right guard rails and human in the loop. Um, and then 2025, especially with the rise of a lot of the capabilities of coding agents, um, I think as we're we're starting to see this shift towards basically full-on agentic harnesses that will just like reason on its own that you prompt it in English and then it'll just go off and do increasingly long horizon tasks. And I think that's really what's going to get us to kind of like this 50 to 80% plus uh, work automation where you basically just have, you know, an army of agents, whether it is for coding uh, which we're already seeing, right? like how many of you guys are actually writing code by hand anymore. Uh but also just like general knowledge work uh being able to dispatch an entire army of agents to just like go and do things for you. And so you know the model harness uh is kind of defined as basically everything surrounding the core model that allows it to basically be an agent like claude code is a model harness on top of opus 4.7 um and then you know open claw obviously as a model harness but the core components of it is basically like you know MCP skills tools the actual agent tech loop itself and I think for any builder out there um as a startup the the the model harnesses are getting the general model harnesses are getting pretty good. Like, you know, I I think basically a lot of people could probably get by by just wrapping cloud code and just like being able to program skills in English. The thing that you provide that provides like kind of the alpha or moat on top of this is really the context and the workflow layer. And so, this includes your organization's business logic. This includes the context. And so, for us, it's document-based context. Um, but you pair that with, you know, other types of context too. whether it's structured databases, whether it's web search, whether it's something else. And that's basically kind of the role of any like AI engineer these days is really figuring out the workflows and the context you provide these really powerful agent harnesses to automate as many tasks as possible because that abstraction has started to get a little bit more centralized.
And so, you know, I think obviously the the ways that you ingest context have gotten a little bit more sophisticated.
um everything from kind of basic rag back like three years ago to MCP to skills to just like generally um this idea of having an agent in a sandbox with a bunch of files and you can do file search over it. Um this idea of orchestration I think has also changed.
Um I do think there's kind of a split between uh like being able to kind of create some sort of deterministic routine workflow which I think still has a lot of value for automating you know like low complexity business processes.
um anything that's kind of repeatable, you can kind of codify as some workflow with the right guard rails and tasks and basically run it repeatably so you can kind of um run it at scale for any new type of data.
And then of course you know in 2025 2026 onwards um there's an increasing portion of just like general tasks that will just be handled by prompting generalized agents. And so, you know, related to what I'm saying is instead of actually you writing code to create some sort of adantic workflow, you write English. Um, you provide the right context to it. And your goal is basically just writing English and providing the right context for as many tasks as you can so that these general agents can solve as many tasks for you. And so, I'm pro we're probably going to see an increasing shift towards just the use of generalized agents, especially as the task horizons get longer. you know, the the the kind of like time it takes to complete things. Um, like agents were capable of solving stuff up to like 30 minutes. Now they're able to solve stuff up to like five hours long and then like basically this is the year they can just start solving stuff like in an ongoing forever fashion. Um, and you know, I wouldn't be surprised to basically see everybody at the end of the year managing an entire army of agents uh if you haven't done so already.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











