Reinforcement learning (RL) is the essential algorithm for bringing GenAI models to production because it provides a systematic, mathematical way to integrate feedback from business metrics, client feedback, and environmental rewards, unlike instruction fine-tuning or prompting which lack systematic improvement mechanisms; RL enables smaller, faster, and cheaper models while providing data ownership, and it naturally fits agent training by creating synthetic data pipelines through environment training with reward signals, making it the only algorithm that can industrialize the model lifecycle from MVP to production and beyond.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive MLAdded:
Hello everyone.
I am Alessandro Cappelli. I'm co-founder and chief customer officer at Adaptive ML.
At Adaptive ML we build an RL ops platform as in reinforcement learning operation that allows large enterprises like AT&T, Manulife, CCS to build, evaluate, and serve in production their own specialized large large language models.
I'm I'm here to show you how reinforcement learning RL is not just any other algorithm for post-training, but is an an algorithm that at its core will bring models to production. Around 3 years ago uh I was part of a team that trained Falcon. Falcon around 3 years ago was one of the most widely adopted open-source models and we realized with, you know, my team that is actually the core founding team of Adaptive ML that the gap that was missing between bringing an open-source model to production versus proprietary models like of Frontier Labs like OpenAI was actually reinforcement learning.
95% of GenAI pilots fail to reach production.
Why that is the case? Uh I believe it's what we call the myth of the last mile.
Here you see, let's say, a time description of what does it get to get to production, which uh I believe it's false. And this idea that the hard part is to get to an MVP is to you know, come up with a demo that looks nice in front of a stakeholders, in front of your colleagues, and that is the the hard part. And then, you know, there will be just the last mile, where the last mile will actually get the model into production. The issue is that most MVP are built on top of proprietary models, or are built on top of open source model using instruction fine-tuning.
Both of these solutions won't let you systematically improve your solution.
They don't integrate in a really systematic and mathematical way what are the defects you might find in the journey to get to production. I will give you an example.
If you test, let's say, your solution and there will be some defects, and you're using a proprietary models, all you can do is to change the system prompts. Now, you change the system prompt in one direction, you might have other defects, and there's no really like a really scientific systematic way to improve that system prompt nicely in a way that you can easily monitor.
Likewise for instruction fine-tuning.
Best you can do is to iterate over the data set. That might be expensive. And what after production? Will you keep creating a new data set every single week?
This is what I believe is a more realistic view of what getting to production and beyond actually looks like. Getting to an MVP is not easy, but it's just the first mile.
What actually, the real journey, the real marathon is to get from an MVP to production and beyond.
And the secret to do that is to accelerate model life cycle. Is to be able to integrate every single feedback you can get from a variety of sources to keep improving your solution. This continuous retraining, refinement, and improvement driven by real client feedback, business metrics, and environmental reward is unlocked in a systematic way only by reinforcement learning.
Reinforcement learning, as I before, my entire point is that by almost by design, by nature, allows to integrate feedback in a almost a mathematical way.
But, reinforcement learning is not just like not just that.
Compared to other post-training techniques or steering behavior techniques like, you know, prompting and instruction fine-tuning, they all, let's say, reach the same goal, which is steer a model's behavior, but they're not equally effective.
Reinforcement learning is disproportionately more effective than instruction fine-tuning, and likewise versus versus prompting.
Reinforcement learning unlocks, let's say, outsize performance. So, what does it mean? What you can see in the plot that, you know, you can get the same performance with a rel with respect to SFT with a much smaller model. What it unlocks that actually help you to get to production.
A rel-enabled scale at adoption. What do I mean by that? As you train a smaller model, those models will be cheaper to serve at scale, and the tokenomics of your use case will eventually make sense.
When you are a big enterprise like AT&T, any use case, any feature that you want to be a commodity either for internal employees or any any you know, facing customer features you might think at scale will cost you millions of dollars.
As an example, AT&T they summarize every single transcript that might happen between, you know, a customer and an agent. Just summarizing that cost them millions of dollars.
If you can train a model that is much smaller than, you know, uh you know, uh uh a chat GPT or a Sonnet, you will save money.
Another things you unlock is that smaller model will be faster. Now, not all use cases requires speed, but many of them they have a threshold of latency that is not just something nice to have, is a constraint that will prevent you to get into production. Let's say you have a model for customer support that is powering, you know, a speech-to-speech system. You can go above half of a second, and I would say half of a second is already weird. When you're talking to someone and you get and you have to wait half of a second, that's already weird. Ideally, it should be a third of a second. And a third of a second is something you will never get if you're using large language models. You need to use small models.
Maybe, you know, the latest Gemma, the later Mistral, the later when of, you know, in that 10B family, but you can use much larger models. The last thing you unlock is ownership. You will own the data that you give to the model cuz the model will be trained on your own business data, and you will own the solution. So, you don't need to worry about, you know, the latest update of the model that may shift performance underneath your feet.
Everything I've said so far, it is true for any use case you might think.
I mentioned summarization, could have been classification, uh could have been OCR, could have been anything you can think of.
And reinforcement learning is already the better choice. But now we are in the era of agents, and agents actually make everything more complicated. Agents require more tokens, more complexity.
There is less room for errors because now agents will have access to the data, will change things in the database connected to, you know, either, you know, internal employees or client to you might have.
So, all of that raises the standard of what can be brought into production, and it raises further question on whether the tokenomics of an agent actually makes sense or not. I mentioned before, you know, just for a summarization use case, you might spend millions. Imagine if you scale an agent to actually 10x the number of tokens.
RL advantage that already existed only widens when it comes to, you know, training agents. Cuz RL at its core was actually made to train robots, you know, to to train agents to to live an environment. And environments are where, you know, agents actually behave. So, RL naturally fits in a narrative where you want to train a model to be a good agent. Now, there are two scenarios. Uh either you already have an agent in place.
As an example, you know, like we work for Manulife. And Manulife already had agents. So, they already have an actual workflow that has been settled. Like we don't need to recreate it on our side.
You can directly plug a model that you might train.
Can be, you know, the latest one 3.5.
And you can directly train the model on an environment that already exists.
If such environment doesn't exist, it can still be built. You can still mock the tools. And you can mock, if you need one for the specific case, a mock user.
Like if you want to create a chatbot that has access to tools, a mock user might be an LLM.
What about the reward? The reward will be just, you know, any business outcome, any KPIs, any LLM as a judge that might define what success looks like to you.
Like was the agent helpful?
Was the agent useful? Was the agent using, you know, a tone and a vocabulary that is following business guidelines?
On this topic, two colleagues of mine, Letizia and Joao, they recorded a workshop that you might find, you know, on our engineering website that will show you exactly how you can actually train a model by plugging into an existing environment.
When I talk to clients, one of the main, you know, sources of doubts on whether they will ever get to an MVP or to production is because they don't have the data to do it. Data was already an issue before agents. After agent is even more of an issue because agents' training data doesn't exist in the wild. Like, there's no such a data set that you can scrape from the web where, you know, an agent is using tool.
You don't have such such a data set. The nice thing is that when you train a model with reinforcement learning and you have an environment and you have a reward in place, basically, you just build what is a as a byproduct of your of your environment, you created a synthetic data set pipeline. Basically, as you have an environment, you can literally create trajectories that are good because the reward that you put in place will tell you what is good and what is not. So, basically, you can do rejection sampling and create a data set that you can use to bootstrap, you know, the first training of a model. And the nice thing is that even though many companies don't have the data uh exact the exact data that is required to train agents, they still have a lot of data sets that can be leveraged to improve the entire experience in the environment. Such data might be, you know, real transcript between a customer and an agent that can be given to the mock user. The mock user can even be trained on that to be the actual realistic person that might be annoying, that might ask things three times in a row. Uh we work with, you know, like with with customers like uh medical supply where people might call them might be in panic. So, the correct behavior might be, you know, I will escalate you uh to a to a human agent or will call 911 for you. And that kind of, let's say, dirty real conversation is something that can be easily mocked by using proprietary data sets.
Where is the human in the loop?
Um a well became famous uh in the LLM world thanks to ChatGPT, because OpenAI published a blog post where we're saying we did RLHF, so reinforcement learning from human feedback.
But, sometimes that human in the loop, which is nice to hear, sometimes what actually hides behind are expensive annotation campaigns. So, in my experience, nobody wants to run an annotation campaign. It is either expensive, or it is really useless, because the reality is that people don't want to do it, but you still want to keep a human in the loop. So, where does the human in the loop come in the equation that I just showed you?
When you train with RL, the most important thing you want to do is to build a reward signal. A reward signal might come from different sources. Might be a systematic reward when it comes from, you know, does the code run? Is the syntax is correct? It can come from direct KPIs or bit and business outcomes. One of our client, CCS, the medical supply company I was mentioning before, has as a customer support system that as any other customer support system, what it's trying to maximize is containment rate. Like, how many calls are actually brought end-to-end by the model. And that reward, that percentage of calls that actually brought end-to-end is something you can directly maximize.
Many other things, like was the tone correct? Were the business requirements followed? It's a bit of an open-ended question when it comes to systematic reward, but that issue can be solved with LLMs as judges. So, the human in the loop is helping just by defining the rubrics, defining the system prompt to these LLMs judges, and defining the scenarios. As in, making sure that it's just aligned with what they see. But, this, let's say, activity that that the human will do will take from few minutes to hours, but it will not take weeks, and you don't have to do it iteratively dozens of times.
For everything I mentioned before, RL not being just one algorithm, but the one algorithm that industrializes bringing model into production. In the last 2 years at Adaptive, we built the Adaptive Engine that is an RL Ops platform to evaluate, tune, and serve the best LLMs for your business.
The Adaptive Engine is an holistic platform where you can observe, train, and serve at once.
When I mentioned at the very beginning that the goal is to accelerate the life cycles, that doesn't mean to accelerate training per se, right? You also want to evaluate, be sure that the model is actually behaving, and you want a systematic way to find defect pre- and post-production and to act accordingly. This is something you can do only if you have a systematic, holistic approach.
Our models are built on top of the best open-source models, any, let's say, most open-source model you can think of that are available, like the latest Gemma Gemma 4 that you heard a few days ago, the latest Mistral, the latest Gwen, they're all, you know, available in your company, and depending on, you know, the model of your preference, you can start building on top of it.
And finally, what what's the catch with RL? The only catch with RL is that reinforcement learning is actually hard. Reinforcement learning is not as easy as changing a system prompt, and it's not as easy as just building a data set for instruction fine-tuning. Uh one of the most famous RL algorithms, which is uh PPO, uh requires orchestrating not one, but four large language models at the same time.
That is where, you know, Adaptive Engine shines because we let you define the rubrics and the rest, but we we take care of the complexity of reinforcement learning by exposing a series of pre-built recipes for you. So, you don't need to implement the latest algorithm, say GSPO, and you don't need to build like, you know, the training recipe to run an actual training. So, once again, RL is the one algorithm that will let you bring model into production in a systematic and industrialized way, and all of that is possible with the adaptive engine.
Thank you very much FOR YOUR ATTENTION.
I HAVE A QUESTION ABOUT UM UH SOME OF THE HUMAN feedback portions that can get incorporated into RL. Yeah.
Uh for example, um last year Cursor had a blog post where they outlined how they take human feedback from production data, such as whether or not the top completion is accepted or not. Yeah. And they constantly use it to update the model. Yeah. Um in settings like this where an LLM is at play, um like in a more traditional LLM RL style, you would do several rollouts per prompt, and pick the ones that uh work, and train on those few samples.
>> Yeah. When it's human feedback, and there's a single signal, uh single signal, um do you do uh like replays to um have many variations of outputs for that problem and train on it, or is it effective to just have a single uh implicit feedback from production train on as a reward function?
So, I would say, you know, like what what you ask is how do we, you know, uh leverage a human feedback. Uh I would say there's two, let's say, scenarios. There's a scenario where, you know, sometimes the human feedback is like uh it doesn't come from production, right? It comes from, you know, uh like 10 to 20 feedbacks. In that scenario, what we do is that we basically usually use it to improve the LLM as judges, as in that is good, that is bad, like how does it fit into the current description of, you know, of what you are trying to do. And the nice thing is that as you go to production, then you will have thousands of such feedbacks. So, what we do is that we usually, rather than using for LLM as judges, at the very beginning, we just use prompted a really big large language model, say when 235B.
As we move to production, we have so much data that what we do is that we use this data to train reward models, so that we can basically scale that human feedback in like to actually train actively the LLM.
And then, you know, with respect to, you know, to the question on when these feedbacks are not as explicit, but more like implicit, I think it really depends on the specific use case, and then we can build a reward model accordingly.
You know, because we already have the data, we can have do two different scenarios to see which kind of training actually gives you the best output give at the best performance given a certain evaluation.
You're welcome.
Okay. Then, thank you very much.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











