In modern AI development, data is the most underestimated and critical factor, often overshadowed by compute and talent. As AI models become more powerful and black-boxed, upstream data problems become hidden behind layers of abstraction, yet they fundamentally determine model performance. The more powerful AI becomes, the more critical it is to understand that data quality, curation, and context—rather than algorithmic tweaks—are the key ingredients for success. Every enterprise will need to run their own data-centric loop of measuring with data, finding gaps, and building more data to fill them. This is particularly important because benchmarks are getting 'benchmaxed' before they are useful, and understanding where AI fails is now a safety question, not just a capability question.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
GAEA Talks Live from HumanX - Why Data Decides Who Wins AI with Snorkel CEO Alex RatnerAdded:
The more powerful, the more kind of black box and easy to use the AI becomes, the more that these kind of very often dumb seeming kind of upstream problems get get a get hidden.
Look, there are some really fascinating kind of emergent properties of the models [music] that we can't explain, but a lot can be traced back to what data do you put in?
So, they thought that challenging thing was going to be like complex [music] mathematics and long form logistics and chess and things like that.
And they thought [music] things like seeing or touching or speaking were going to be trivial. So, they literally gave all of computer vision, which is still a multi-billion dollar, [music] you know, industry of productizing and researching to one intern for a summer project.
Live from San Francisco and Humanex, Alex Ratner from Snorkel.
How you doing? Good. Well, thanks so much for having me here. Um So, I guess I can introduce myself quickly. Yeah, that would be that would be fantastic.
>> Give the spiel. So, Alex, um I'm one of the co-founders and CEO at Snorkel AI.
We're a a company that does data development for AI or we call ourselves a frontier data lab for AI. So, basically, um we put together the data sets and if you're deep in the weeds here, what are called the environments that help to evaluate and then to tune or train AI. So, really simple way to think of that, uh which isn't that far from the truth uh um of what it actually looks like is imagine you're trying to, you know, teach a student something or teach teacher test a new employee about something. You might give them a bunch of test questions to evaluate and then you might give them a bunch more practice questions, you know, where they had gaps, right? And that is, you know, one of the most critical development cycles that's happening behind the scenes of pushing AI forward. Right?
There are kind of three legs of the stool today that people talk about if you think about frontier AI, but increasingly we think it's going to be every company building their own specialized AI and agents. We can talk about that later. Wherever it is, there's compute, there's talent, and there's data. Yeah. And um you know, taking a step back, my other kind of quick intro is I'm I I was on the academic side. I'm I'm on an affiliate status right now, but but um still a little bit active at the University of Washington where I had a lab and then prior to that was at Stanford where we spun out of and one of our co-founders is a professor there. So, I guess company's been about 6 years, but I've been working in AI data for about 15 years counting Stanford and U Dub. So, dating myself as a real dinosaur. They kick people out of this this conference for saying things like that. So, you got to protect me if someone overheard. Um uh but really, you know, 15 years ago when we started kind of the work that led into Snorkel, no one in AI, certainly no one in the Stanford AI lab, was thinking about data. Everyone knew that data, and we could talk about this, is kind of the the key ingredient >> What cool is >> of AI or what we called machine learning back then. Um uh but no one really thought of it as the problem of an AI researcher, an AI company, an AI person to solve that. It was someone else's problem. It was upstream, janitorial.
And we started talking to real users, you know, kind of forward deployed engineer style, you know, on the academic side and realized that everyone uh didn't really care about our fancy algorithms and fancy new models and everything. They they they were stuck on the data.
And then we saw um some of the trends of deep learning and then that became LLMs and everything we see today and we realized, "Hey, if this continues, we had we we got it totally wrong about guessing the pace at which AI would accelerate like it did today. We thought if it did, then these things would become so big and black box that how you develop them would be a lot more about the data you put in than how you tweaked and tuned them.
Just like how you onboard an employee or teach a student is a lot more about what data and and and information you feed them, not performing brain surgery and rearranging neurons. So, that's kind of me in a nutshell, you know, a company academic all around this idea of data-centric AI, this idea that, you know, the data that you feed in and how you develop that data, what data you pick, how you how you curate it, how you use both humans and automation to get the best quality data at the pace of the frontier, that that is kind of the key you know, the key ingredient and the key development at the center of AI in many ways. So, longer longer than should be for a quick intro, but hopefully it gives a gist of what I what we could talk about today. I think that from my perspective and the the numerous conversations I've had and experience over the last decade or so working with with enterprises and all manner of different companies, I am yet to find a company, especially the bigger it gets, that has their data together. Yep. No one does, but it's assumed at every single stage and it'll be fine.
I mean, an an example we worked with a very large company here in the US and it was sitting with the emergency management and the the particular issue, a big hurricane comes by and we had to mass communicate to 10,000 people and 5,000 people got the messages.
And the client says, "Well, we're not very happy 5,000 didn't go out." I said, "Right, well, That's seems like a little bit of a slip up in emergency management.
Exactly. But it was it it turned out to be a data problem because they had adopted an HR system, which was opt-in. There was no field recognition or no field um rules. So, emails didn't need an at. Phone numbers could be words. Yep. And uh pretty much a lot of them were like pretty much piss off or something at you.com. They just made stuff up or there's nothing. Yep. But no one checked. So, when that reflowed and that was given to the the emergency management department, well, they did try and send emails to ones that don't exist and ring phone numbers which were not even numerical characters. Yep. And it it took such a long time to say, "Well, how do you fix the root cause of the data problem?" And that means, "Well, we need to look at how we manage HR, how legal has how deals with things, how compliance, how onboarding happens." Cuz it was very it grew this massive issue, but something absolutely fundamental wasn't taken care of. That was the most obvious example. No, but it's a great example. I mean, a lot of those examples are obvious. Um but the more um the more powerful, the more kind of black box and easy to use the AI becomes, um the more that these kind of very often dumb seeming kind of upstream problems get get a get hidden behind layers of abstraction and I think it's all kinds of things and and and speaking about data, you know, there's a particular type of uh data curation and labeling and generation that that we do in the tool chain. There's a broader universe of data from, you know, the data lake where the stuff sits to the integration of different data feeds and databases to the kind of operations we do. But just talking about data as a whole, one of the um one of the the the the phrases that people are increasingly using these days is context. And so, you know, I'm kind of making a connection that not everyone makes, but it's it's really I I like that Can I just say is is that Cuz I I see there's two contexts. The the technical term of context within AI and context from the human perspective of observational intelligence relating to the context of we are here now. The context is where we're doing a podcast.
>> they blend. I mean, it's a good distinction. I mean, the kind of like technical term as far as these terms are standardized for an for a large language model, for, you know, a chat AI like ChatGPT or or or Claude or or or an open source model today is, okay, the context is what you kind of put into the prompt. Yeah.
Right? So, so um but it really is a and it's sometimes it's dangerous to anthropomorphize terms because AI is very different, you know, than than humans and smarter in some ways and way dumber in other ways and unpredictable.
You know, people call this jagged frontier of intelligence. Um but it is a decent analog to just context for a human. Think about the the smartest uh you know, think about the the smartest possible human being.
You know, you drop them in and you don't give them any context. You don't give them the database of names of who to email. You don't give them the constraints or the company brand guidelines. You don't give them the objectives of what they're trying to do.
They they they can't possibly succeed.
Yeah. So, for an LLM, the context, yes, means what do you drop into the prompt?
Or now there are skills files if you use Claude or what whatever it might be, but it's just it just gets kind of shoved into the prompt that you feed into the model.
But the broader sense of the word, like you pointed to, is it's the same thing for humans. It's what is what input are you given? Yeah.
And and now I'm stretching the usage that's most common a bit, but I would say that this context comes in at all phases from kind of first initial training of a model all the way to, you know, when you're trying to use it, you know, at your computer screen.
And and I think all of that is critical and that all has to do with data, getting data into the model. So, models start with, you know, pre-training, which is kind of like imagine infant stage, just kind of soaking up lots of grammar and instances of people talking. For LLMs, it's soak up the entirety of the internet. Um but some of our research, you know, some of the the students I've gotten to work with, the the research that they've done in the open source recently from U Dub and and and Stanford and elsewhere has been about how the mix of data you curate, what context you give the model when it's first learning, is actually critical to performance. Yeah. Pretty intuitive, right? If you, you know, put more finance documents than than uh you know, uh you know, I don't know, cinema documents or more angry versus versus happy kind of language, it's going to tilt it's it's different context for the model. And then, you know, any company that's building or kind of tuning an AI model or an agent is kind of putting in more data at at all stages. There's a post-training stage, which is kind of like think of it as like going to college or grad school.
There's reinforcement learning, which is a type of that, which is very, you know, people are very excited right now, but it's again, more just teaching, giving it context of how to do certain things.
And then there's the the context that you you know, give your, you know, chatbot or agent right when you're um asking it to do something, either in the prompt or piped in from, you know, some knowledge base in your enterprise. And in all these stages, it's just a simple idea of you're trying to give context or data to the model so that it knows what it should do, right? And that's the key part. This is really important for I think a lot of people who may think they understand AI and may well do from a siloed perspective, but I've had so many conversations with the the AI researchers that I work with about that data collection or data creation, then the pre-training, then the training, then the post-training, the idea of distillation, the idea of pruning. Yep.
But all of that to do with context. Context being functionality, purpose. Where's it going to be used? How do How do you how do you create something which could then be entirely scalable by kind of horizontal independent deployment at edge?
>> Yep. Yep. But still has some relationship with a a thin sliver of wider cap- cloud-based interconnectivity.
Of course, if if you're trained uh to be a pole vaulter primarily Yep. and you are a pole vaulter that's probably going to make you a good pole vaulter. Yep.
>> But if you say, "Oh, well, you're an athlete, so now run a marathon." But you've never trained for a marathon, why would you expect a particular outcome?
Yeah, so but you can have a team, I suppose it leads to the the agentic idea. Yep.
Different skill sets, different expertises. Yep.
>> And the efficiency gains are astronomical. But all of that comes from knowing all the points and purposes and understanding training isn't just one thing and data is not one thing. It's a process. Yeah.
I think it's I mean, there's so many interesting points to branch off on there, but I think you know, first of all, there's this idea of generalist versus specialist, which is, you know, we we think about that with humans, right? And and there are trade-offs and, you know, there's a notion of you know, for humans, there's a notion of ability to generalize, right? You know, sometimes you want someone who's an all-around athlete and you figure, "Okay, they're actually I I would rather take an all-around, the best athlete and I'll teach them to pole vault Yeah. versus someone who's only ever pole vaulted, but is, you know, less strong. Like, there's trade-offs, right? And there is a notion that if you you know, in in most areas of human specialization, we still try to educate someone in a well-rounded way with a spike. So, you know, we're following those kinds of paradigms on the, you know, LM or AI side. I still say it's very early days. I think right now, we're seeing such tremendous gains from just continuing to improve the generalist that has kind of context on everything and and and and and data for everything, right? Um And I'm not against that. I mean, that's very good for companies like Snorkel, to be very clear. And it's very exciting for any of us in AI. And and I think for all the you know, there's a all the positive benefits that are to come with that. But I think we're going to see a swing inevitably back towards, you know, some generalist bases.
Utility players are always important and generalists, you know, someone who's gone to, you know, college as a generalist to then train into specialists, whether through tuning or distillation or some mix of the two, um that's going to remain a paradigm.
So, it's not generalist or specialist, it's just more about both. And in this world where you have you know, your super generalist utility player model that maybe takes the first call in from the customer or sits next to you and, you know, triages as a co-pilot chatbot, and then you have all these specialists. And I very much believe there're going to be many, many specialist models, agents, or whatever you call them. Um I you know, the key of navigating that mixture is all about the data and the context that you you give them. All right. Um So, it's I think it's going to be both. I think it's going to be a very interesting, unique world. And I think that it also happens to be the the version of world where the unique data and the unique context that every organization, every individual has is actually, you know, gives them that unique specialization edge um versus just three models let's say ruling everything. Yeah. So, what what did you get the inspiration from a from a kid?
I'm always fascinated to understand when you're a kid and you say, "Right, I want to be an astronaut. I want to I want to play ball, whatever."
What was yours?
Good question. Um I I um What are the interests? I mean, what I the earliest ambition I remember that was concrete was being a a chef, fireman, and doctor, which I've heard is a a hard a hard trade-off to balance. Um and then I wanted to be the CEO of a rocket ship company, um which which I didn't I didn't also didn't hit. So, I failed that's now four times, I guess, depending on how you you group countings, but I you know, I always was a, you know, nerdy, sci-fi-oriented kid, which I guess is fairly generic for this area, but of of of SF, but um I I I I was fascinated by, you know, by the idea of AI and of automation and of kind of how how do you how do you systematize things? How do you take all the craziness of the world and and um you know, find out how to to put structure and order around it, which is kind of one way of phrasing automation in some way.
Um especially if you believe it's not to replace people, it's to kind of just standardize and accelerate things.
So, I think that was kind of like a kernel of interest and then um I got into AI.
I was like physics and math undergrad. I just kind of thought fun, tough problems.
Solving fun, tough problems. That was the the the full extent of my like career planning, which was maybe not that I worked super hard, but I wouldn't say that that, you know, I had a good uh trajectory of what career that led into.
I just thought solving cool But problem-solving is is a an instinctive almost obsessive thing, which not in a negative term, but >> No, obsessive is definitely the right word, yeah. a fascination and an inquisitive mindset to go, "Why? What what what's happening here? It How can I make it better? How can I improve things?" That that would be the same of Einstein, Da Vinci, the all sorts of people who tend to be not generic. Generic is the the wrong word, cuz I've put myself in a similar category, but it curious in a general way. Agnostic of anything particular and in in many ways removing the labels to see from the physics perspective, because language is very much a human construct. And if you can take the labels off, you then you can have the right to put them back where appropriate in a systematized manner. Which is very much Yeah, yeah, very very germane. No, so I I like that a lot. And I think there's um you know, I will definitely not liken myself to any of those figures, but that sense of kind of general curiosity. I mean, one of the stories, I'm going to butcher this uh and should probably just, you know, pause and look it up on Wikipedia instead of plowing on, but I will plow on, cuz we're in the middle of an interview or discussion. Um there's like a famous example of Richard Feynman. Um like he was stuck on some problem and he decided to let himself just kind of, you know, he was an eccentric character. I think it was him. I could be getting the figure Richard Feynman is my favorite most favorite human ever to have lived, by the way. Yeah, I mean, so then you'll correct me if I'm completely mixing up the the historical persons.
This is definitely a scientist. I think it was Feynman and it sounds like him.
Where he was stuck for a while and then he let himself just kind of mess around and work on what he called unimportant problems. Hm. And I think there was something where he was like in the cafeteria and he was just like spinning a plate on his finger.
And this got him thinking about, you know, how to kind of solve this challenge of how to kind of think about spins and orbits and it actually led into his like Nobel Prize-winning work.
Wow. Again, I could be totally But but that that But I I think that's the general curiosity you're talking about and the problem-solving drive. And that does resonate as well. I mean, the I I mentioned on a on a podcast previously, but he there's a documentary he did where he was interviewed by the BBC and he was being asked questions and it it you may have seen it in the past and it was to do with him him talking about being with his dad and his dad saying, "Tell me about this bird, this bird." And his dad told him everything every name of every bird. He says, "Now tell me what you know about birds." He And he said, "Oh, I know this one." He says, "You know nothing. You You don't know anything about how a bird works, how it flies, what the aerodynamics. You know their names in English, in Japanese, and all these other languages, but you know nothing about the bird."
And I think that it was that mentality. He was saying it in the relation to just observing with O-ring when he was part of the space shuttle disaster figuring out and just the the the plastic being too cold, becoming rigid, and how he demonstrated it just in a glass of ice.
And sometimes life is that simple.
To intentionally say, "Okay, don't force the issue. Do something else." Yeah. And if you can do something else, play it play guitar. He played bongos. Yeah. Or just go play the bongos and you allow that space in your mind. It and it it's such a simple fascinating thing. Yeah. And it it we should look as human beings and say, "That's a fascinating insight." How much information and if we're looking at AI and how we build AI and what data is there, there is a huge amount of information which is subconscious. Yeah.
Like that yeah. The temperature that is right now, you're probably pulling in sensory input Yeah. and calculating a vast amount 99% you're not even cognizant of. Yeah.
But there is a contextual element.
>> Yeah, there's a lot of there's a lot of it's back to that idea of context, right? And the idea that there's so much context that's baked into every decision we make, right? It seems like it's a simple thing, but there's I'm taking only one part of what what you you know, what you were saying there, but that part of okay, there's so much rich context both like learned over the years, that would be our equivalent of pre-training, you know, coming in through from sensors and and from from, you know, things on our phone and and just all those feeds coming in, which the equivalent of, you know, there's a lot of a lot of the companies here, you know, uh um at at a human ex there, you know, they're either a vertical company that is a lot, you know, or or they're a horizontal kind of toolkit a lot sizable percentage of those are about bringing the right context Mhm. domain specific, organization specific context into the model. And so it is and then once you have all of that, the the you know, in some sense it might just be a simple insight required to kind of unlock the the problem or the answer.
And it it seems to your point and the point of some of these the Feynman stories, it seems like this like little simple lightbulb moment. But it's built in this mountain of context.
And I think to you know, to to to to segue shamelessly segue back to to AI data, I mean that is kind of you know, what what happens um in training and then, you know, uh connecting these models and agents um yeah, training them and then connecting them to the right data once they're you know, deployed in a certain specialized there's this unspiringly large mountain of of information that goes in before you get this kind of last mile tweaking or tuning or interaction.
Um and and a lot of that doesn't is not is not seen, but you know, I mean we're we're pouring in you know, I don't know, more you know, more human hours of kind of instruction in any given subject that you can think of you know, by token count, by by hour of human effort has gone into a modern AI model than you know, like any human has received in 10,000 lifetimes, which is staggering.
And and and and, you know, I guess to complete the trajectory like I'm fascinated by that now. I think originally I got into AI 15 years ago really just from an applied perspective because it wasn't this like it was not as hypey back then or or it's not like it was getting hypey, but at a whole 'nother level this is still called machine learning.
Um Which technically is the correct name for what we're doing. Uh yeah. Cuz it's AI learned from data, which is technically just everyone is so fascinated with that subset that they just smoosh the names together, but but technically still I'll cement my status as an old dinosaur by uh by by saying that all of this is machine learning now, technically. But um uh or I guess we're doing more test time inference. All of it's built on machine learning. Anyway, um for me it was just a just a problem, like a very simple problem. It's actually back it's back to this theme of like a really mundane insider problem leading into interesting stuff.
For me I remember I was working in this consulting job. I had done math and physics and I I I almost went to grad school for for material science, decided I didn't want to, I went and worked in finance for a little bit. And it it was a dumb problem. I was trying to look up some some some patent records.
And I tried to write a script uh you know, to basically just normalize all the different ways that IBM was referred to.
There's like maybe 40 or 50 different kind of different ways to, you know, expand IBM or or have punctuation or have subsidiaries and now this would not be that hard, but it was not quite trivial. Like just a heuristic pattern match that was So I got sucked into this rabbit hole of oh my god, like you know, you have this objectively fascinating thing, the patent corpus, which even back 15 years ago or 70, whatever this was, like you know, at least if you stripped out the images, you could put on a thumb drive. That's everything that anyone would ever ever considered patent worthy. Yeah. In all of human modern human invention on a thumb drive.
Fascinating.
But even to just look up all of the records of a single company Mhm.
was not solvable with just kind of coding up something, let alone to get useful information from it.
So I kind of fell into the AI hole just from a very a very dumb sounding problem of like, "How do you even look up basic information in one of the many incredible stores of of data you have?" And then you know, turns out that well the just writing code like we used to, software 1.0 was was very difficult to pull anything useful out of unstructured data. But this is where the simp I wouldn't say the simple problems are solved.
I would say often important and compelling problems are solved. We had a a similar thing.
We were look we were observing how a virus behaves during the pandemic Yeah.
as far as transmission.
And that helped us understand how to map time space in four dimensions so that you could fluidly um understand the relationships of cause and effect across space and time.
But you wouldn't think that you would figure anything like that out by going, "Well, why how's a virus work?" But the the rules around transmission are actually relatively describable in simple terms cuz it doesn't care of your religion, your race, or anything else.
>> Yeah.
There are variables which can be described and we we began to look at that. It led to something completely not related, but it was it was the the ingredient. And it was a few years before we'd been looking at "How do you describe someone saying hello?"
Now that sounds silly.
But if you're in Newcastle in England, you'll say something If you're if you're in Liverpool like the Beatles, Yeah.
they'll say hello like "Watcha." They you you can have I mean across the United States is different. Australia, "G'day." Yeah. So before you know it, even just in English, Yeah. there's thousands of ways of describing hello, which also geo-temporally describe probably within a location in the UK. Yeah.
You you can listen to a handful of words and probably tell where they're from within 10 miles.
>> Which is mental.
That is really interesting. Yeah, I mean it's these little things that that uh uh end up not always, but often ending up into these much more interesting problems like, you know, And it's it's data capture. It's data relationships, data understanding. And if you trained a model and they always think uh like where the Beatles are from. If you trained a model from a Scouser, Liverpool person, or Oasis type guys, you know, Manchester, or a Cockney from East London. And you can have, you know, someone from San Francisco or someone from the Jersey Shore.
And you train them specifically in these areas, what what is the the properties of that model and what would the differences be? It would be both fascinating, funny, humorous, but It would be fascinating. I mean this is another topic that that we think a lot about both at Snorkel in our role kind of supporting, you know, we we work with um uh pretty much all of the major frontier labs, but also a lot of the, you know, folks who are building specialized models, you know, um both vertically high companies as well as enterprises. Um and it goes back to this theme of kind of specialization.
But but it it something we think about there and then also on the academic side is how much I guess it's the same generic point, but just how much the the data that you put in has these far-reaching effects on on the outcome. Like to your point of okay, if you train a model just on like Jersey Shore versus Liverpool dialect, like you will not only get a a model, especially if this is an audio model, too, that that sounds different. Yeah.
You will get so many different subtle differences in what it knows, in how it thinks about things, etc. Again, this is such an obvious thing to say out loud, but maybe one thing to emphasize is like we work in this area both academically and commercially.
And so maybe I sound like I'm knocking ourselves by saying this, but it's we we are so early and immature in figuring out the science of what should we bake into these models that I I think we have a a lot of effort. Like right now um there's a huge amount of effort goes into this and and we've done research on this front of how to get the the mixture right Yeah. of data at all stages to kind of get the type of outcome you want. But it's so subtle, right?
>> Yeah. It's you know, you can you know, just Jersey Shore versus Liverpool. You Think about the analog being you collect data, you know, with a 10% bias to this corner of the internet versus that corner of the internet. Yeah. Can have very, very subtle impacts on how that model thinks and reasons and speaks and what it knows downstream. And so, I think the um you know, look, there are some really fascinating kind of emergent properties of the models that we can't explain. But a lot can be traced back to what data do you put in? Yeah. And this is still such an understudied area. Um you know, and and and um you know, we think it's fascinating where I was trying to put more structure around it. A lot of our work academically at the company is how do you you know, use both expert humans along with automation to be more precise and transparent about how you structure and create and curate the right mix of data.
Um but it's still so early. Uh you know, It's also time as well where you could look at anytime where something new happened.
>> Yes.
>> And look at the industrial revolutions and um even like the invention of the steam engine.
And I somebody was having a a chat the other day about why the lunar lander had its legs a certain distance apart.
And that was all to do with well, the train system that it was carried on and the the shape size of the rockets were determined by what the train could carry. And you could root it back to Stephenson's rocket in effect. And the choice of well, how do we choose whatever 6-ft something >> for the train gauge. And then you think, well, what's this got to do with that?
There has to be a period of time that you watch everything unfold.
And there will be positive and negative consequences.
But we have to keep observing. I mean, we're we're really so fresh into this new we'll call it the the common AI wave. It's AI applied in the public psyche. Yep.
And I doubt from social media perspective, the intent would have been for one thing, and there have been peculiar consequences. Yep. There's good, there's been bad.
In the same way you could hold a knife and use that to be a Michelin star chef. Yeah. I could take the same knife and do some horrible things with it. It's only after enough time can you observe and then say, ah, we need to rechange the ingredients. We see these iterative changes and it's reversing back into that pipeline Yep. of the pre-training, training, post-training. And I think the I mean, there we do have to have some awareness that's going to take time. At the same time, I think everyone in the world right now pretty much I mean, I'm saying that I'm coming off like an SF SF bubble inhabitant, right? But I think it is fair to say, you know, you know, there is a lot of urgency to figure that out sooner, right? Given the the rate of increase and the scale of impacts. And I'm I'm not someone personally who thinks, you know, AI is going to be done and humans obsolete next year.
I'm the AI person, so I obviously believe it at I'm a believer in what's the classic phrase is that people tend to uh overestimate short-term impacts and underestimate long-term impacts. So, I'm probably in that camp. Um you know, my my my day job both on the research side and at Snorkel is literally to you know, I say this without sharing anything proprietary about the the models or their data recipes or anything, but literally that to 90% of the data we build, it has to be uh hard for the model to get right. Yeah. So, if if you go back to the analogy of work creating and curating all these uh like test questions, Mhm. right? We don't get paid to get test questions that the model can answer in its sleep because that's not additive. Right? We get paid to build data whether it's a a you know, an agentic coding task or medical question or a financial, you know, presentation building task where the model struggles. So, I I I'm sorry, this is a bit of a detour, but just to just before going back into the the the pace question, you know, literally my day day job and our day job at Snorkel is to create is to probe where the models have gaps and build data there. And so, there are lots of gaps and there's an ever-expanding surface area of training these models that will do literally everything.
And I believe in continued human innovation, so I think that notion of everything is always expanding because of not just AI, but human innovation.
Arguably, it accelerates. So, I think this is going to take I think it's going to be quite a long venture Yeah. to find all those gaps, let alone fill them in as the world is itself changing and advancing and AI is making knowledge increase even faster. So, I do think the world is going to change quite radically. I I don't think it's going to be next year. I think we have some time.
All that being said, I I agree with the general sentiment that we have to figure out whether that that knife is going to be used for cooking or stabbing very quickly. And so, a big part of what we do and actually what we're um I can briefly show a new kind of grants program we put out there.
Um a lot of what we do every day, uh you know, commercially and then on the open source and academic side is try to build uh benchmarks which are basically data sets to figure out, you know, where our models working and not working in in in new areas.
And um you know, we do this uh you know, uh there's a a big bulk of our work for a lot of the frontier labs, you know, build a benchmark of I'll simplify it, you know, 10,000 coding questions to see where my coding agent, you know, is working, where it's not working. And then when it's not working, okay, let's buy more data or me and environments there to improve it. Um We also do this in the open source side. We just launched this um we call it open benchmark grant open benchmarks grants. Sorry, double double plural. Can always confuse me. Um so, we've committed 3 million initially. I think we're going to grow up quite a bit even in the next couple months. Um uh to basically fund open source academic teams, anyone who's contributing something that is open on on new benchmarks to measure new kinds of capabilities. Um so, we're trying to it's not all about just about safety, but really trying to understand like and it's not the there there are many ways to inspect these models, but one is to build, you know, kind of little tests or exams for them and figure out what can they do safely, what can they not do. Um this is for safety, but it's also just for capabilities. How does a This is like progressive unit tests on a perpetual basis. Exactly. Yeah. Because I I've always thought I know, I did a a podcast prediction video which Yep.
often can be a stupid thing to do. But I I predicted that tests will be obsolete within a short period of time specific to the measurement that they're measuring Yep. given the rate of evolution. So, if you're looking at the benchmark and saying, this is the benchmark. Oh yeah, there's no the benchmark anymore. It's a it's It's not constantly evolving. Yeah, right. But it should be tested and it should be understood, but of course, if it's new, then take that into consideration. Goes back to the context you're talking about.
>> Yeah. And I guess I I I I smiled for a second cuz as a as a someone who has a commercial stake in building benchmarks, you should take my claim that there's always going to be a rising need for benchmarks with a slight grain of salt. Although, a lot of the work we're doing, we're actually just trying to fund academic teams. And actually as a you know, with my academic hat on uh still, I think this is a one of many, many great areas for academic and open source teams to contribute. I think there's a lot of pessimism I hear sometimes of you know, here from grad students who are, you know, current or prospective grad students from others of a what what can academia do without a billion dollars of GPUs? Well, there's quite a bit, including literally setting the guide posts for the field with the new benchmarks that both expose and and define areas where things don't work yet, right? So, I think like there is no more the benchmark because things move so quickly.
And we do see, you know, like benchmarks get what's called saturated very quickly.
Uh you know, they get saturated sometimes in ways that are are kind of uh counterproductive or deceiving in that like people, you know, it's like someone people overfit to that, what's called, right? Um Now, I think the the you know, the the terminally online term is benchmaxing.
This idea that, you know, there will be 10 new benchmarks and you see them on the model cards that come out with the with the models and, you know, the the the cynical view is well, teams will run towards them and just see how they can get the highest score possible. But not in a way that generalizes. Like someone who's, you know, cramming for a test without actually learning the material.
>> And that that does distort reality there, doesn't it?
>> happening a 100%. So, what do we do?
Well, it's not to say that we shouldn't measure models anymore. We we need benchmarks. It's just to say we need to be a lot more clever and dynamic about how we we build these benchmarks. Uh um so that they are harder to game and they're more dynamic and evolving and um but I think that is one of those ways that both from a safety perspective of, you know, the usage of the knife as well as just a understanding capabilities, which is yeah, you need to know the people call talk about this jaggedy edge because it's somewhat counterintuitive of where models are good versus bad like like our performance and not. Um that is also related to safety because if if you overhype AI's capabilities and it's actually really bad somewhere and then you go and, you know, lawyer uses a you know, an AI to write a you know, court defense and it hallucinates citations or you know, patient asks an AI, we're doing a lot in healthcare, a medical question and gets an you know, fatal like a very dangerous answer in response. Like knowing where the models, not just knowing is the model going to like threaten us with a knife, but knowing where is it actually ready to deploy and use, and where is it not?
is very critical, not just for improving and filling the gaps, but for safety.
So, we're big believer that this like kind of benchmarking and since we're talking about kind of historical anecdotes, I'll just add one thing there like I I I do think there's a lot of um this jaggedy edge phrase kind of captured it, but there's a lot of misunderstanding of kind of where is AI good versus bad. And one of my favorite favorite anecdotes, there's a name for this um um Moravec paradox, which um I don't know, nice fancy sounding title that I looked up on Wikipedia once. But, the the anecdote behind this is that back in the '80s, uh one of the first um uh they had like a summer summer session for for um for AI. I guess this is this is before uh every grad student went off to do a Google internship uh you know, every summer, and people did summer sessions and summer summer uh you know, brainstorming camps in academia, which sounds awesome. And uh they convened a bunch of professors and grad students, some of the you know, earliest people working on AI.
And they assigned um basically focus areas based on what they thought would be easier or hard for AI based on what was easier or hard for humans.
So, they thought that challenging thing was going to be like complex mathematics and long form logistics and chess and things like that.
And they thought things like seeing or touching or speaking were going to be trivial.
So, they literally gave all of computer vision which is still a multi-billion dollar, you know, industry of productizing and researching to one intern for a summer project.
Cuz they figured, "Oh, well we can see since we're babies. That must be it's easy for us. It must be easy for AI." And so, this idea that there's often a conflation of what's easy and hard for a human and what's easy and hard for AI uh persists to this day. You look at, for example, some of the hype around um you know, some of the earliest benchmarks uh were and have been on, you know, mathematics and you know, coding challenge problems.
And uh uh and AI has blown them away.
And this is shocking cuz most of us uh especially those of us who are nerdy enough to try those competitions know how hard those are, right? And and it's insane.
But, the counterintuitive thing is even if you just take one area like like coding a longer task that looks like a much more mundane, you know, software engineer's job but that has messy input context and long shifting objectives and you know, different personas and pieces of the stack they're dependent on it and a a messier kind of less verifiable outcome, you know, like a product feature, not just an answer that can be checked with a unit test AI still falls down there.
So, you have this thing that seems like the pinnacle of of intelligence in computer science or math or something like that.
Um but actually because it's kind of locally contained and easily verifiable AI can learn to do it really well. Yeah.
Uh and then you have something that seems really mundane, but it's messy and it's got all this messy input, messy output, messy context that is actually really hard for coding agents still. So there's an interesting point about kind of what makes things easy versus hard, but the broader point I'm just trying to illustrate is we are still facing this this uh this phenomenon of conflating what's hard for humans with what's hard for AI and vice versa.
And all this bleeds into not having a great understanding yet of where we can trust AI and where we can't. I mean, I would add to that I know it's I'd love to just get as the final point your your view on that.
Is and it connecting benchmarks and capabilities and information and training is surely the benchmark should be actually like yeah even I'm trying to think direct contact. So, we worked worked on an augmented human.
And we've been speaking to people and developing things about uh voices. Yep. And numerous other modalities. Yep.
But, something which is a real benchmark, if if you could see a version of you that you didn't know was you or your wife or your kids Yep.
>> it shouldn't be against what it looks like a human. Yep. We have such nuances and it highlights all the information Yep. which we're not taking into consideration but are actually really valuable because they can help us understand um reality should be the benchmark. Yes.
Yeah. And and it highlights I mean, that's a great example of that's kind of like a multimodal Turing test kind of um it it it it it highlights how difficult it is to create truly realistic benchmarks. And and also, look, benchmarks are just one piece of the puzzle. Uh it's a whole there's a and maybe this kind of wraps up a little bit of, you know, what we think about and what's, you know, from our corner of the, you know, as a as a data partner to to so many of the AI labs and companies, what's what's interesting commercially, academically in our neck of the woods like it there is really this whole set of tools that are there's a number of them and they're subtle that that you have to employ together to first use data to measure Mhm. where where is AI even working or not? And then to improve. Yeah. And and measurement is tough. You have benchmarks that you put out there to kind of, you know, help inform. But, if you focus on them too much, that's also a trap. You have more complex tests like this that you could call a benchmark, but also often require some real-world human judgment and testing. Those I kind of think of as akin to the in-the-field evals, you know, could you deploy an agent in an enterprise setting and could a co-worker there who's really experienced actually believe it's competent as a co-worker versus you know, calling out like So, it's all these different ways of measuring. And the common theme that we like to think about is that they all require data. You need to you need to create data.
You create it for measurement and then you find the gaps and you create it for improvement. And and, you know, this is something that that we do and, you know, a lot of our work today is with the the the big, you know, labs that are training the kind of generalist do-everything models.
But, I think, you know, we're headed into a vision we're headed into reality where everyone in the world, every enterprise, every organization is doing some version of this loop.
They're coming up with their unit tests with their real-world field tests, with their mini benchmarks to help them understand is AI ready for usage in my life, my enterprise in this setting? They're then using that and I think it's going to become easier and easier to do this you know, to then tune and specialize their models. And I think that loop, that data-centric loop of measuring and then improving with data is going to be something that every everyone, not just the the big labs uh you know, does and someone who's worked on that for years, it's exciting and I think it's just exciting for a more you know, more diverse uh uh uh landscape of AI and and uh um yeah, I guess my my last a little I guess uh you know, marketing quip, not even for Snorkel, just for data is if if you if you are interested in that, if you're trying to get there like kind of to where we started, don't forget the context, don't forget the data. That's the most important ingredient to get right.
Well, on that note, I appreciate it.
Thank you so much.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











