拡張機能をインストールして、あらゆる動画内を即座に検索しましょう

RLHF and Post-training Overview | RLHF Book Course, Lecture 1
追加: 2026-05-05

525 回視聴4346:10natolambert元のリリース: 2026-04-14

Lambert masterfully deconstructs the complexity of RLHF, turning a nebulous alignment process into a clear, structured engineering discipline. It is an essential foundational lecture that bridges the gap between raw compute and human-centric AI utility.

[00:00:00]Okay, let's do this. This is lecture one of what I'm describing as the RLHF course. I've been working on the RLHF book.

[00:00:10]Plenty of links will be below. Most people already know this and I hope that the viewership of this will grow over time. I will do accompanying videos with questions and context for this course.

[00:00:22]Everything will be at rlhfbook.com.

[00:00:25]This first lecture is an overview of RLHF post-training, how we got where we are with language models, and we'll have a very different tone than most of the lectures at least until maybe later on.

[00:00:37]This is all about what RLHF is, reinforcement learning from human feedback, how it became post-training, how I view the world, and where the exciting things are now. And you can troll me in the comments for it still being the reinforcement learning from human feedback book. It will be that and as we go through the lectures you will understand why. So let's get into it.

[00:01:00]To start, this will start broad as broad about as broad audiences as I will get through the book. Some of this will be technical and I will take questions. We have to start with like what is an actual language model? And language models apply applied probabilities to pieces of text and in reality the texts are just a representation called a token, which is a chunk of word. It is an internal representation of the model and these models are auto-regressive. So given the previous inputs, these models will predict a distribution over the next output. And you repeat this over time and they're called auto-regressive and this is what is done to decode a sequence of text like a sequence of tokens into words when you're chatting to a language model and gives you a longer response. On the right is a famous diagram which is soon almost a decade old which shows the original transformer architecture with attention interleaved with dense layers. It was an encoder-decoder model which is a specific architecture that's been simplified. But this is where everything started and it's unbelievable that it's almost 10 years ago now.

[00:02:04]I think language models have been changing a lot. I describe it as big tech is industrializing this technology.

[00:02:11]It is scaling it up in every way it possible to extract performance.

[00:02:16]Now they have billions to trillions of parameters. The best ones are going to be in this size range for quite some time. They are downstream of the transformer architecture which has remained remarkably steady with for popularizing its use of the self-attention mechanism in this language model architecture. The transformer did not invent attention, but they popularized it with a really scalable architecture. And today the models are predicting much more than just text. I think it's very notable that Gemini and the GPT models power a very wide range of multimodal projects products from images, audio, video, and they're all changing very fast.

[00:02:55]Claude is distinctive to not have as many of these modalities, but this is because Anthropic is making a much narrower bet and over time these models will work with all sorts of different types of data.

[00:03:07]So if we take a timeline starting with when the transformer was born in 2017, we'll kind of go year by year and talk about how the core ideas of what a language model is was changing over time. So in 2018 there was a series of both like mostly academic in open research on establishing the foundations of language models with different representations and different approaches.

[00:03:33]GPT-1 was OpenAI's approach of starting to scale web scale text.

[00:03:38]ELMo was from Allen Institute for AI where I still currently work and BERT was from Google. ELMo was known for learning a specific type of representation that became popular and BERT was a series of classifier models which is one of the most used architectures of all time to date in any open model adoption mechanism metric.

[00:03:58]These were the foundations of modern language models. It's just getting the shape right and the data pipelines right. And 2019 is when the early scientific ideas of scaling laws really became more well-known. I think that GPT-2 was a manifestation of this where they just started scaling up compute and the empirical science of the idea that training on more compute gets a predictable power law relationship to test loss. This is a major scientific breakthrough that all progress is still relying on today.

[00:04:29]And I think then we go into GPT-3 which is when people first really started to get surprised by the capabilities of language models. GPT-3 was known for this these behaviors of one-shot and few-shot learning especially where the model when given examples of an out-of-domain task, something the model hasn't been fine-tuned on, it could then learn to extrapolate and like perform reasonably well in this new task. At this time it's good to note that language models were really fine-tuned for a specific set of tasks and this sort of generalization wasn't assumed to be the case in all of the models. So it's wild to hear today in 2026, but people were very alarmed and totally changed their career trajectories based on the performance of GPT-3. It was very surprising and very unex- it's just unexpected for people. This was still before I was working on language models, so I was out of this loop.

[00:05:22]In 2021 there was a paper called On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

[00:05:28]I put this in to show that there was still a major question over what the value of language models were and how this was going to be scaled. I think some directional arguments of this paper are likely still true which is like what is the cost of building this giant technology. But you can look at it by the name Stochastic Parrots. And there is a population of people that think the language models are just this and they're innately assuming a lack of value in this framing.

[00:05:59]Um I won't go into this too much, but I would be on the side that this the core thesis of this paper has largely been refuted and that was just like an era in between capability jumps. Because after 2021 comes ChatGPT in 2022.

[00:06:18]When ChatGPT was released I was at Hugging Face and this is when I started working with language models.

[00:06:23]Obviously everybody here will know about it. The thing that maybe forgotten was that reinforcement learning from human feedback was cited as the key capability that helped push GPT-3.5 over the edge to be very useful for users. And this key insight of like RLHF being linked to ChatGPT's viability was why I and many other people got very excited about RLHF back in the day.

[00:06:46]But we continue and 2023 was about scaling models, getting things set up.

[00:06:53]GPT-4 was the ultimate YOLO run. On the right here is a famous image of Jensen Huang at a GTC conference where he supposedly leaked the parameter size of the original GPT-4 model about being about 1.8 trillion parameters. And 2023 was an interesting year. Like you had like Google try to launch Bard and failed then, but it was really just starting to grab resources.

[00:07:16]And the language models under the hood looked fairly similar. Come 2024 the biggest step was that we changed from these shorter answer models to these reasoning models that think extensively before answering and they unlock inference time compute. I'll look at I'll revisit this plot later in this talk, but there's a major change in how much demand and just what the capabilities for models were. The models immediately became way more useful and used more tools and it just pushed more momentum onto the fire. And I would describe 2025 as being when agents really started to work. We had Claude Code, we had O3, we had the search models and so on. And by the end of 2025 is when the model Claude Opus 4.6 was released which was a major change. And that leaves us at today where I could put 2026 is like is this can we're start going to start to see the impact of agents on building more AI. That is a more exploratory topic that is outside the scope of a a lecture like this which I hope to be timeless, but I will happily talk about this at another time.

[00:08:18]So if we go back into the language models, like what is a language model?

[00:08:22]We need to at least start with what pre-training is doing because this is the foundation of language models.

[00:08:28]GPT-1, GPT-2, GPT-3, when people are looking at these models they were not thinking about they were not interacting with a largely fine-tuned model most of the time. They were taking a base model and seeing wild capabilities.

[00:08:40]And pre-training today is at extremely large scale. So the models are trained on 5 to 50 trillion to hundreds of trillions of tokens. I think the the variance there is based on you all have a unique number of tokens and then the labs will consider training on multiple epochs to get the most out of their model. For reference these a trillion of text is about 3 to 5 terabytes of data and labs will gather 10 to 20 x that amount of data by crawling in order to then subsample and take the final data. So the raw data that you consider to train on is petabytes and all of these labs have their own data sets of these size and it's just like the scale of the web text that people are using is very very large. It's much more like obviously it's much more than the actual model weights will take to store. But these are incredibly flexible models and they're all trained with this next token prediction idea where you have a a next token in the sequence and a correct label and you are going to you're going to modify the probabilities of the model so that it's more likely to predict the next token that is correct and it downweights all the other tokens.

[00:09:50]When you put this a lot of these examples into a batch, you get a very useful, valuable learning signal that gave us all the magic of language models that we use today and post train and everything in between.

[00:10:05]So, when you look at this, if you were to take a base model just pre-trained on this web text, what you will get is like an auto-complete agent. So, if you pass in a a chunk of a phrase or anything to the models like the president of the United States in 2006 was, these base models, this example is from Lama, oops, um would continue the text. They said the president was George W. Bush, the governor of Florida until do this kind of somewhat repetitive thing and it kind of looks like you're reading something on the web. And this is where post training comes in. All of post training, whether it's instruction fine-tuning, RLHF, etc. is built on the foundations of this chatbot or assistant persona. The structure of what we do is that we reformat how the information is presented to the user.

[00:10:51]And now it's like this question-answer format. So, a fine-tuned model answer like in a complete sentence or a phrase and it's like George W. Bush was the president of the United States in 2006.

[00:11:01]He served two terms in office from January 2000, blah blah blah. And this Tulu model is fine-tuned directly from this base model. So, they still have an intellectual like lineage to it.

[00:11:13]And what we see over time, like this was a this model is a few years old at this point and what we have seen is that the style of these responses changes a lot over time.

[00:11:23]So, we have this um Sorry, that we'll come back to that. And like what why the styles change is cuz we kind of have different training stages and what people the tools people are using and what people are optimizing for changes. So, it to kind of build the foundation for the intuitions of the training stages that we'll talk about in this book is like obviously pre-training it builds the models' world knowledge, it builds fluency, it builds a lot of honestly unexpected links and capabilities. I mean unbelievable the amount of models have.

[00:11:55]And then there's a variety of types of optimization that we can do in post training. We can do what is called instruction tuning in SFT, which is training the model on specific examples.

[00:12:05]We can do preference tuning in RLHF, which is when you are figuring out ways to show the model a comparison of samples and kind of extract the subtle signals between them.

[00:12:14]And we have this reinforcement learning with verifiable rewards, the RLBR, which is all the recent thing where you're training models with RL on verifiable problems like map and code.

[00:12:24]How I think about these is that instruction is really great for kind of instilling specific features and formats in the model. So, if your model likes to repeat a phrase like as a language model trained by blah blah blah, that normally means your instruction tuning data has too much of the specific phrase that you don't like and you can remove it. Where preference tuning in RLHF is a much softer optimizer but still very useful for the intelligence of the model because it's it's not using this auto regressive loss that instruction tuning and pre-training are. Where it's obvious that auto regressive loss functions are not aligned to what users want from the model. They are effective but this kind of like RLHF loss looks at everything together based on a label from a human judge, an AI judge and then it adjusts all the probabilities in the sequence at lunch to kind of nudge the model into more useful directions. In RLBR we're seeing today with scaling is just fundamentally a useful way while it's lower it uses less information than pre-training which is every token but it just gives a supervision signal that can let the model learn to solve harder tasks with a lot of compute in post training.

[00:13:33]We'll come back to all of these so I don't need to spend too much time on this slide.

[00:13:37]And this what this means is that like we need to use SFT as a foundation and it's still super effective in tons of domains because if you know the distribution you're targeting instruction fine-tuning is the easiest way. It's efficient, it's simple to create high quality data but RL post training is really richer, it's response level, it's where the future is of development and like we even see things in models that we've trained where RL will kind of like unbreak the model a bit. So, we've like done some things and the data was messy before but these reinforcement learning optimizers are just really nice for rounding out the model and making all the it's which is a changing the probability internals of the model to make better answers.

[00:14:21]And when we go back to this example of like what are the different types of things that the models are completing and like what they look like. Like ChatGPT was when RLHF model made the models easier to use and pushing this direction it kind of adds format, style, it starts to make the model feel like more of a product and you see features like uh markdown text. So, you have bold, you have lists, you have clever turns of phrase that the models like to use.

[00:14:50]Unfortunately, you also have sickophancy which is a trait where these AI systems like to tell people will tell people what they like to hear and this is because people like it. I think there's there's fundamental truth in this even if it has downsides in terms of societal consequences and kind of questioning the model.

[00:15:11]So, this is where I come to the core of the lecture which is like what is reinforcement learning from human feedback? What is RLHF? What was the early literature on it?

[00:15:21]So, to do this we needed to cover some reinforcement learning. If you are very new to RL, the famous book is the Sutton and Barto book which you can read. It is excellent. But really a reinforcement learning problem is often format formatted as a Markov decision process which is kind of a trial and error learning where you have a state space, an action, transition dynamics which are determined how the environment unfolds, a reward function which shows how good you are, a discount factor which weighs current versus future rewards.

[00:15:51]And you kind of solve this objective to maximize the rewards within this setup.

[00:15:56]And it's known as trial and error learning, you balance exploration and exploitation.

[00:16:01]Um you have the state space, the actual like all these things you define.

[00:16:05]There's this mathematical notation but really it's this top right which is the agent acts in the environment, the environment determines what the next state is that the agent has to act in and the reward. And the agent is trying to maximize reward through these many interactions and they learn largely the literature used to learn from scratch where language models are a bit different.

[00:16:25]So, a simple example would be like a thermostat where you have the dynamics of the room, you have your target temperature and you could learn a policy that can turn the heat or AC on and off depending on what the reads of the current temperature is. And really the policy would be like how that thermostat is getting turned on and off. You can actually formulate the math as in the bottom which is like how do you predict the probability of different trajectories based on that you have a given agent taking actions. And this ends up being kind of a a multiplicative probability of like what the state probability transitions are which is like what are the environment day environment dynamics and what would the agent do in every step. So, pi here is the policy and then you have these probabilities of state dynamics and various things.

[00:17:09]This is obviously very contrived.

[00:17:11]What would be a standard task in RL is something called a cart pole which is a cart where you can move the bottom and you are trying to balance this wobbling pole at the top. You determine the state which could be like the position, the velocity, the angular position, the angular velocity and these actions and rewards. And this is a very common thing that the the simplest RL algorithms can learn I think. You will be able to ask Claude to build a cart pole example and an RL algorithm to solve it and you'll be able to run it on your laptop and see it improve over time. Like it is very low compute and you can do this and it's kind of a canonical example.

[00:17:46]You would could actually like what you would do in that case is Claude would determine these dynamics and write a like a approximate dynamics as a differential equation and it would do this one step dynamics and actually solve this. I think what is fundamental here is you're seeing that these dynamics evolve over time and you would have to take many actions to move the cart each step which is like classically RL is a multi-step control problem.

[00:18:09]We're trying to solve for trajectories of agents in an environment trying to solve a specific goal.

[00:18:15]I think there's a question of like why did people transition from this classical RL into RLHF? In classical RL the environment is defining a reward function as part of the world. The reward function is what is defined as being good. Cart pole has this kind of thing where the model wants to have the cart pole be balanced and then you get reward. That's like a time-based reward.

[00:18:38]But for many tasks these rewards are unknown. When you think about the real world and what we do every day, we do not really know what the reward function that would reduce our jobs to a scalar is. Examples for human is that it's it's easy to judge which and there's other things like it's easy to judge which poem is better but it's hard to write a good poem. That's not directly reward but it's it's showing us like what types of supervision could make learning better. If it's easy to judge something, maybe we can use that feedback to as like a proxy for what reward would be and rather than trying to learn from scratch. And it's like uh where pre-training optimizes this next token prediction where like oh it's going to need different things going on even though we can't always specify the exact reward which is useful in the downstream world. Sorry, the text on the slide is a little muddled but we're going to keep going.

[00:19:28]So, in the history of RLHF I would ask you which is a better backflip? If I present you with two options, which is better? And this is encoded in text, it's not a reward function but this is something that you could actually do.

[00:19:43]I would choose the one on the left but you choose your own. I think I asked this because this is the sort of task that was used to study RLHF before language models. There's some work from 2008 and the the GIFs I showed you are from a 2017 paper, the seminal paper that's like Cristiano et al. It sound It's like learning to something with human preferences. And they use a pretty simple setup, which is a standard RL algorithm setup, but um they train a reward predictor to learn the reward from the environment and they use human feedback to actually do this.

[00:20:19]And what they did was like this was a data training sample where they or this wasn't a training sample. This was an example where on the left is the final agent that was trained with human feedback and the right was when the human wrote a reward function for what a good backflip is. And it's clear that this human feedback worked better. The training data would have looked like a comparison between this.

[00:20:41]And it's pretty remarkable that this works at such a simple experiment compared to what we're doing today.

[00:20:47]I think there is a lot of work that tran- transitions us from this Cristiano et al. work through language models through the modern era. I think that an underrated paper is the Ziegler et al. 2019 paper. I can see if my I can click through, which is like fine-tuning language models with human preferences, which is like a work where a lot of the terms of art for what was being done today, like preference models, how you fine-tune was actually done, but it was much less known. The work that popularized RLHF was this InstructGPT paper and then some work from Anthropic on Constitutional AI. And this right column is like the sort of modern transition from RLHF on the left and then into it becoming post-training and we're going to cover all of this, but these are the types of things you should go read. You can click through the slides and get all these references on like building the history of what became modern RLHF.

[00:21:46]So like here's a bit of a summary of more of them. There's more that I didn't have on this figure. Like WebGPT is a popular paper where they trained a model to browse the web and answer questions using RLHF. That sounds awfully timely now. DeepMind had WebGPT was from OpenAI as well. DeepMind had Gopher site and Sparrow, which are works that look really similar to OpenAI's RLHF work and and thing and Constitutional AI even.

[00:22:09]They just weren't as popular, but the the seminal work ended up being InstructGPT, which defined this three-step recipe for RLHF in early post-training that that powered ChatGPT's first version.

[00:22:22]So I talked about this comparison data between agents, but like it's good to think about like what would RLHF data actually look like?

[00:22:31]So what traditionally happened is you take a prompt, you generate two responses either from the same model or similar models and then a human would decide which response is better. It could look like this, which is on the screen. There's many different interfaces and then you use that to train some sort of model, which can capture human preferences and you can kind of make it more like the better responses.

[00:22:54]I think when we compare classical RL to RLHF, I made a big point of classical RL being very time-varying, having rewards at every step and all of this, but RLHF is pretty contrived because there's no environment. The agent is the language model. It generates a completion and it just does that and then it stops. That completion is rated by a reward model, which is a proxy rather than a ground truth reward.

[00:23:18]And then it's like kind of just one reward per sequence. So it's much more granular. It's It's definitely pretty contrived relative to RL, but the magic is that the methods we designed for reinforcement learning algorithms worked when we kind of prune the setup down here. And there's another KL penalty, which we'll talk about in the class, which is just like how to constrain the model because we're starting from a strong foundation of language and we want to keep the strong priors of the language model rather than learning from scratch in traditional RL.

[00:23:48]So we'll see a lot of these diagrams throughout as we go through. And this is one of the most famous diagrams in RLHF's history, which is like the classic three-step RLHF recipe, which is step one, collect demonstration data and train a supervised policy. Step two, collect comparison data and train a reward model. Step three, do the actual RL, optimize the policy against the reward model using RL using RL.

[00:24:14]This is all a bit outdated, but it is shows the simplest possible way to get to RLHF. So you need to do instruction tuning and you need to collect data in order to actually do this RL. And this is what um InstructGPT showed.

[00:24:28]We're going to go into each of the steps.

[00:24:31]So the step one instruction tuning is really about getting the highest quality completion data.

[00:24:36]Back in InstructGPT's time, all the completions were written by humans. So we'd have humans write things like um writing that a poem about goldfish to get high quality training data to demonstrations. And it's really about getting desired agent behavior. We trained with supervised learning with the same kind of setup as pre-training. You have very different settings like batch size, learning rate, etc. So it's like the batch size ends up being much smaller.

[00:25:03]Um you're trying to learn the specific behavior, but not overfit to it. So the learning rate ends up also being a bit smaller. And it's just like a much more constrained optimization than pre-training where you're worried about overfitting too much to the specific style of answers, but you need to learn a much more specific format than pre-training where pre-training was much more about generalization.

[00:25:26]So this is really the foundation. You can probably find whole courses on instruction tuning out there. There are a lot of useful resources. I think I intentionally am not going to go into it too much in the course. There's a bit more slides in the next next lecture, but it's pretty minimal. And then once you have this instruction tuned model, you have to actually figure out how to train a reward model. In the RL setup, the reward model is like what's going to be the guide for RL. And we collect comparisons between two models on the same prompt and then you can train a reward model with something called a Bradley-Terry model, which essentially is predicting the probability that the chosen answer will be given a higher score than the rejected one. Um kind of based on the language modeling background you have.

[00:26:07]It's trained with this log-likelihood loss, which is somewhat simple to to implement and it's trying to pull across these two examples. You give it a good example and a bad example and it's and the same model is learning from both and it's trying to separate them in the model's representation.

[00:26:22]So it's like the the RLHF is do- RLHF reward model is doing an interesting thing where the model is predicting the probability that a given piece of text you showed at inference time looks like one of the chosen responses. And it's kind of an interesting thing that falls out of the math where you train it on this pairwise data and you run inference on just one piece of text that you're trying to rate.

[00:26:44]And this is important to setting up for the next step, which is RL against the reward model because RL needs the scalar reward. And there's a lot going on here.

[00:26:52]This diagram is very complicated. I'll talk about it more in multi- probably multiple dedicated RL lectures.

[00:26:58]But it is just that you are sampling from the model and then you're using this reward model to rate the samples.

[00:27:08]And then you are taking gradient steps with RL. It sounds very simple that it all works. There's a lot of setup to get it right. Like you need to have a reference model stored with your trained model and the reward model. Like the system side is very hard.

[00:27:22]But looking back today relative to the other parts of language modeling, the high level is fairly simple. And that's very good that it's just like the core idea is that this RL can learn from the compression of the human preference data in the reward model.

[00:27:36]Actually looking at the RLHF objective is very important. So what we're trying to do is we're trying to learn a policy pi that's maximizing this kind of expected reward. The R phi here means that the it's a learned reward model, but we don't want to change the model too much. And to do this, we have this KL penalty, which is the distance between the initial model. The initial model is normally the SFT checkpoint that we trained earlier in the three-step recipe and where the new model is with RL. Pi ref is the reference model in this case.

[00:28:06]And it's just like we want to improve the behavior, but we don't want to change the model too much. In early RLHF work, when you change the model too much, the model degrades into total yapping. It'll repeat characters like JavaScript. You might have language inconsistency and and all sorts of problems.

[00:28:24]One of the biggest breakthroughs in the last few years of RLHF kind of asked the question theoretically of what if we can optimize this more directly? The direct preference optimization paper is essentially a fun bit of theory that derives a direct gradient to this equation above for an optimal solution pi star. And the gradient looks at the internals of the language model. I have a missing parenthesis on this slide. And um you can submit PRs to the slides on GitHub.

[00:28:52]Fun fact, if I don't fix any of this stuff by now. And you actually don't need to train an extra reward model.

[00:28:58]This will all be covered in a separate lecture. And it is a wonderful paper and this made RLHF and post-training so much more accessible. It's far simpler Oh, it's a ren- It was a rendering bug. Fun.

[00:29:10]Fun.

[00:29:10]DPO is far simpler to implement. It's far cheaper and you can achieve a lot of the final performance. I like to ask leaders in industry like am I killing myself by training our models like too slowly with DPO? And no one has proof that DPO is actually bad for the models even though if it is so much simpler and seems like a bit of a shortcut.

[00:29:32]As I said earlier in this like there are methods that are used but aren't well documented. One of these is rejection sampling.

[00:29:39]Um a formal procedure is on the page, You generate completions from your model.

[00:29:46]You use the reward model keep the best completions and then you do more supervised fine-tuning, more SFT, more instruction fine-tuning on the model you already trained just on those best on policy completions.

[00:29:58]So, people have used this, it works really well. I personally haven't used it in practice and gotten it to work. I think there's some complexity in the reward modeling stage.

[00:30:07]So, if we look at this, this is why I describe preference tuning as a kind of holistic area. I think that there's this rejection sampling idea where you train a reward model and use it a bit more simply.

[00:30:19]There's this complicated online RL example with proximal policy optimization, this algorithm, which is kind of the state-of-the-art and always has been the goal. And there's these methods like direct preference optimization, which still use this contrastive data. You don't have to go through training a reward model and they work pretty well. So, there's a big distribution of things you can do, it'll only grow over time and post-training is like how do you choose which of these many methods to get the most out of your model given your specific evaluation goals.

[00:30:49]When doing this actual RLHF, we have to remind ourselves that we don't explicitly have a reward function.

[00:30:56]So, overfitting to our kind of learned proxy, we learn a proxy through the data in the form of a reward model, it's like only at its best correlated with user satisfaction and preference. You could read the Goodhart's law quote, but in reality there's this idea of reward model over-optimization. You can get very verbose models, you can get sycophancy, and I think that the technical literature of over-optimization is the place to start and kind of making formal understanding of ways that RLHF could go wrong on the product side. I think there's relatively little work here versus its importance because so much of this is productized and behind closed doors.

[00:31:39]Over time, the training recipes used for models have evolved for a lot. When I started doing RLHF in early 2023, the idea was we need to buy 10,000 instruction tuning data points with human answers, we need to get 100,000 preference points from somebody like Scale AI, and we need to train on those prompts after we had a reward model.

[00:31:57]This has all scaled up a lot. Um for standard recipes, they're using millions of prompts at every stage. I'll talk about how the prompt is kind of the core unit of post-training because it with a prompt, you can generate a completion, you can do other things, and it defines the distribution of your model. When we go to DeepSeekR1, the types of recipes that are being done as RL has scaled up so much, I put NA here cuz it's kind of just like it's just different. Um the recipes again have become more diverse. It's like 223 is kind of like a mature version of the InstructGPT recipe, but there's a lot of variety in emerging and how you can use these different methods, and this is kind of a transition from when like when RLHF became something more like post-training.

[00:32:40]So, if you look at this InstructGPT recipe visualized, you take a base model, you do instruction tuning, you get SFT, which helps your reward model.

[00:32:47]You put these two paths together and you get this kind of final RLHF optimization to this. It's like this is uh probably directionally like what the original ChatGPT was trained with, but over time has gotten much more complicated. And this is why I say like from RLHF to post-training, where now there's like many model stages. You could have ton You could have five, you could have probably have 20 intermediate model versions between getting this final model. And post-training is this really complicated process of using all the tools in your toolbox to get the best possible model out, and RLHF is kind of just one of these tools. Like preference tuning is important, but it is no longer the single definitive thing that people are using models on. Part of the reason why this is still the RLHF book is because RLHF is not evolving as quickly, and this is a useful snapshot in time for what the methods and kind of fundamental problems are. Writing a post-training book would be much harder because far more industrialized, far more opaque, a kind of much more rapidly evolving in ways that aren't as like academically studied.

[00:33:50]So, if you kind of look at this over time, you can summarize it again year by year. It's like 2023, people were making simple SFT recipes for better chatbots or things like Alpaca. 2024, DPO became very popular and training stages started to get more complicated. But then in 2025, we have like RLVR, we have agents and things like this. So, I kind of say 2024 is the year when it stopped being more about RLHF and started being more about post-training. And that hasn't really changed. I'm this book has been a multi-year project and I am impressed by the longevity of the things that I've put into it, which kind of goes to capture that this kind of core RLHF area is fairly protected, and then the post-training is some of the like more complex industrial stuff built on top of it.

[00:34:37]I like to talk about one of RLHF's main critiques that I've dealt with for multiple years. RLHF was often dismissed as just being style transfer, and people were like, "Oh, pre-training is everything you need. Instruction tuning is everything you need." And there was papers on this like LIMA, less is more for alignment, that I think were kind of fundamentally misguided. You can talk about the bitter lesson, and it's saying you don't need a lot of data to extract the performance from the models. Which like when is ever anything in deep learning been better with less data and less compute? That kind of made me think that people had a fundamental misunderstanding. And I I think that it's just like like how does post-training interact with the scale? Like why is why was RLHF viewed just as format? And I think that like formatting is underappreciated where where format is value. I think I use the example of the Sapiens book where um Yuval Noah Harari, God I I don't know his last name, Harari I think. Um he rewrote human history in such a compelling way that it was one of the best-selling books of all time. And RLHF is manipulating the information that the model outputs in a way that is fundamentally way more useful. It makes the model get better benchmark scores, but what people see is mostly that the shape of the text changes a lot.

[00:35:54]And like an intuition an intuition that's brought me through the first few years of post-training is that excuse me, you can take the same base model and change your post-training recipes and get wildly different performances. These are two real Omo E models that we trained at AI2 and released. And it's like base models determine the ceiling and post-training's research job is to reach it. But the base model ceiling is actually really, really high and but base models are getting better really fast. So, it's like doing post-training well is kind of always this like fast art of getting the most performance out of it. But it does not really feel like a like a light touch. It's not easy to extract it. I think that it's like there's some tension in the things that I'm saying between the less data and the extraction, but I call this kind of taking pulling performance out of the model, the elicitation theory of post-training. And it links the stuff I've been saying where it's like this auto-regressive prediction is not the metric that matters to users.

[00:36:51]And RLHF is kind of pulling the best parts out of the model in this new format where it's a bit more useful.

[00:36:58]I think that this is I call it the elicitation theory. That I posted this sometime last year where I described it to a like an F1 chassis where you start the year with this chassis and you iteratively involve evolve around it and make these recipes, and you can get so much out of this base model, but it's not an easy process.

[00:37:16]Here's an interesting recent paper from 2026 that kind of revisited and and tested the specific theory of LIMA, which I thought was fun in making these lectures.

[00:37:27]Where this takes us is that post-training has a really new defined frontier. I think that frontier is obviously reinforcement learning with verifiable rewards.

[00:37:36]Um fun fact, I was on I was leading the paper of that named RLVR, which was 223, which was written way well before DeepSeekR1. And part of the fun of doing open research is that you get to name these things. You get to be on the ground when it is happening. But RLVR is much simpler than RLHF, where it removes the reward model and replaces it with a verifiable reward. And this has enabled things like scaling RL compute, it unlocked inference time scaling, which is spending more compute at generation time per problem, and you get a whole 'nother scaling law that can emerge. And this was just a transformative breakthrough mostly through OpenAI's O1 model and DeepSeekR1. And you can see the system diagram is very similar. You just take the reward model and you change it to a reward function again, which is almost better suited to the fundamentals of RL.

[00:38:25]I think when you compare these, there's a lot that can be said. I talked a lot about classic RL. It's like very granular, it's multi-step. The environment is very deeply embedded in the problem, whereas like RLHF really removed the environment, and RLVR is starting to add it back in with the notions of agentic models acting. This course doesn't really talk about agentic models, but we've kind of go through these transitions where old ideas become new again, and they all have very different like canonical problems on on what they solve. And I think there's going to be a lot great educational material that compares these, and there's a lot of depth in what is going on and what it says about the fundamental field of reinforcement learning.

[00:39:13]Okay, we're back. Brief pause. This is at the transition to kind of where like some of the elicitation theory breaks down and my intuition seemed a little um outdated is where we're transitioning post-training into the scaling RL era.

[00:39:27]And I think this is the exciting part of the field and one that is honestly not captured that in that much in the book because it is evolving so fast. But it is important color for anyone learning about the topic and thinking about diving in that like there is an area that is rapidly evolving and it's where the excitement is.

[00:39:48]So, I think opening eyes seminal plot with when they released 01 preview which was the first reasoning model publicly available um is as influential as the original scaling laws plot. But sometimes it's a little bit misunderstood because there's a lot left to the reader to understand.

[00:40:05]I think we have it has two plots. Let's break them down each individually.

[00:40:09]The first is that there is test time scaling. The x-axis is test time compute log scale and the y-axis is performance.

[00:40:16]Test time compute can be thought of like how many tokens are spent on the problem. You take a specific model size and you get it to spend more tokens and as it spends more tokens you can kind of bucket each bin each response and see what the performance would be there.

[00:40:30]This is a fundamental new property that was really unlocked with RLVR. Inference time scaling is not only done with this kind of long chain of thought but you could also do it by spotting multiple agents in parallel and so on. But this is what most people took away from the 01 plot is this inference time scaling thing which is a log linear relationship between more compute on the x-axis the log scale and the performance goes up linearly.

[00:40:55]This unlocked a lot of new behaviors in coding and agents and tool use was just that like it created this way to dump more compute on hard problems.

[00:41:04]The thing that is often forgotten forgotten is that the training time compute here is with reinforcement learning. So, it showed that you can scale reinforcement learning compute the x-axis and you will also get better performance over time.

[00:41:19]And this is obviously correlated with a longer sequence length but we've seen multiple examples in the literature where I don't think it is only sequence length and is that you can fundamentally learn things by doing a lot of RL compute on the model and this is where the most excitement on the frontier models of today is is that you can scale reinforcement learning compute have the model learn to use tools and fundamentally improve the models in a new domain of training time compute a fixed time cost. So, both pre-training and post-training now have very high um fundamental compute costs.

[00:41:54]I think there are other examples of this publicly. For one recent one cursor has been training their composer models with scaled reinforcement learning on on agentic coding tasks. And they showed one of these plots where you have an x-axis log compute and a y-axis performance. They showed that they released their original cursor one model with the scaling RL uh composer one and composer 1.5 was just what is this probably about 10x more RL compute to get even more performance. It looks like it's going up continually but the performance is increasing on a log scale. This is something we've seen many times as well.

[00:42:30]The famous plot is from DeepSeekR 1 where they showed RL training steps on the x-axis and performance on the y-axis on the math about AMY or sequence length during this RL zero phase which was directly on a base model. There are very few plots like this that show compute or log compute on the x-axis. This is a linear x-axis and then performance on the y-axis.

[00:42:52]The one that I partook in was Olmo 3.1 think 32B where we originally released our Olmo 3 think not 3.1 after about a week of RL training on 200 GPUs. And we were like, well, what the heck? We don't really know what's going to happen.

[00:43:07]Let's keep training and we left it going for another 3 weeks and the scores were just going up and up. This is all downstream of valves just getting better with more RL compute.

[00:43:17]This model was not done going up but again the log scale really starts to bite. So, this is just like a transition where post-training used to be fairly cheap in compute but now it is a substantial amount of time. In 28 days um you would need more compute to do so but in a month you can pre-train a model. I think that like post-training taking this long is really changing the release cycle how people think about doing research and imposing restrictions on the need for new scientific methods to understand how to best to improve your model and spend time compute and focus.

[00:43:52]Like where this leaves us is that it seems like post-training RLHF changing faster than ever before. I would say that the specific direction with agents tool use and scaling RL is so exciting. It'll mostly be a separate book. But it's like the problems of human preferences and it's like how do you best capture a lot of information in a comparison to two things and capture the intent of a human rater in something that is as weird and messy while also deep and nuanced as language. So, the human aspect of this and like capturing these preferences is always going to be in these models forever and there's kind of fundamental problems. And the book is is about these fundamental problems that we're going to study. It is about measuring rewards. It is about over optimization and it gives you the horizontal capture of data of domains of how you evaluate models of what people think about in products to kind of understand the full picture of how this field of RLHF emerged.

[00:44:54]So, this is lecture one. It's overview.

[00:44:56]It covers the first three chapters in the few in the future lectures I will um highlight which chapters I'm covering at the beginning. I just thought this was a nice overview of RLHF and post-training without getting too technical.

[00:45:10]And I will break the lectures up in intuitive ways. Some lectures will be shorter. Some like this reinforcement learning chapter will probably be multiple lectures all on its own um in order to segment the information best to users. So, please get in touch with me on social media at on GitHub for RLHF book issues and any else any other minor things you see in the process of this and I'm very excited to work through this. I think there's more links here. It's built with this tool colloquium that I will talk about when I am marketing it. But really uh thanks for listening and I hope this is helping you. Bye-bye.

[00:45:51]Before I go, I want to add that in the slide tool I built I'm having the references for everything here. So, obviously you go to the book but if you just want to use the slides and find out where to link more look more I have all of this in here and all the other lectures well as well. So, trying to make everything helpful.

[00:46:07]Great.

[00:46:08]Bye.

関連おすすめ

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

トレンド

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29

DOOM Runs On Everything...except Neo Geo

ModernVintageGamer

143K views•2026-06-01