安装我们的扩展，即时搜索任意视频内容

The Hitchhiker's Guide to Frontier Reinforcement Learning
本站添加: 2026-05-05

1,037 观看7528:45rishabhagarwal7540原视频发布: 2026-04-28

Agarwal masterfully bridges the gap between theoretical RL and scientific reality by treating nature as the ultimate reward signal. This guide correctly identifies that at the trillion-parameter scale, engineering pragmatism and asynchronous efficiency are far more vital than algorithmic complexity.

[00:00:03]Hey um hi everyone, thanks for having me. I am Risha Bagarval u from Periodic Labs and today I'm going to talk about uh what it takes to do reinforcement learning at the frontier. Um not in benchmarks but on scientific problems with trillion parameter plus language models. Uh now think of this as a practical guide maybe some of the things we have learned. I'll talk about some of the things we have learned and also I guess some of the infrastructure we have built to make reinforcement learning work at this kind of scale. All right, let's dive in. Um yeah so I I guess before I start the talk I would say like I believe that accelerating science is one of the most valuable use of AI and and the question is uh are we actually directing the amount of compute towards uh accelerating science or advancing it or are we scaling mostly scaling RL on tasks such as let's say math and code and hoping that science eventually will benefit. Uh so at periodic we took a slightly different approach than others.

[00:01:02]So instead of waiting for let's say general purpose reasoning to trickle down uh we asked what happens if we can just point reinforcement learning compute directly at uh let's say real scientific problems with experimental data serving as a way to come up give us some sort of robot right and our core insight is actually very simple so to create an AI scientist uh you actually have to do science so progress has been basically historically fastest unverifiable environment ments and in our setup nature itself is the RL environment. Um so the AI agent you can imagine let's say the AI agent sits at the center connected to let's say real physical laboratory base uh things like physics lab mystery lab um this supposedly proposes hypothesis design experiments and gets rewarded signal back from basically uh the actual experimental results so the reward depends on the experiment it proposed and what happened in that uh let's let's make it a bit more concrete so like here I'm just showing you the scientific method nothing really new you form an hypothesis, you run experiments, you collect observations, you then analyze these results and then you refine the hypothesis. And this is basically a loop of what typically happens in most of the empirical sciences and this is also true for the physical sciences. Now there are two distinct bottlenecks in this loop.

[00:02:25]Uh on the right hand side you see basically hypothesis generation and analysis of existing observations.

[00:02:31]actually you are bottling by intelligence and this is where large language models and reasoning models can come and help a lot u and I think this is where I think AI actually has the the most promise but there's an equally important bottleneck on the left hand side which is actually running experiments now doing so might be cheap in certain domains uh and much more expensive in others for example when you are in the physical sciences you need high throughput autonomous laboratories that can execute experiments at least at some of the pace of what an AI agent wants to treat and without that you have really like a brilliant reasoner but nothing to reason about in some sense right um and I I guess the key insight for us at periodic is that intelligence alone is not necessary uh is necessary sorry intelligence alone is necessary but it's not sufficient so you don't create new scientific knowledge just by thinking really hard you create it when ideas are tested against reality and u I suppose that's why at periodic we building both these sites the AI scientist as well as the autonomous laboratories for them to operate in u all right so so what does it look like in practice so I guess on the left hand side you're seeing an image from an autonomous laboratory this is actually in the lab we have at um in LA park where robotic systems can actually synthesize materials.

[00:04:04]And uh notice that this is is not a simulation. This is also not some sort of benchmark. This is actually a real lab producing actual experimental data and trying to make new materials.

[00:04:14]And then here's the observation step which is uh I guess what is also known as materials characterization. So basically here you're trying to figure out what actually happened in this synthesis or material synthesis. And one of the primary techniques we use is this thing called X-ray defraction or XRD.

[00:04:31]You can see the extra uh the defractometer in uh in the image here on the bottom. I guess I'm drawing my cursor there. Um and our typical defraction pattern produces which is like the squiggly graphs that you can see on this plot that I'm hovering my cursor on. And basically XRD is how you identify what phases or what crystal structures are present in the material that you have synthesized. It is basically a measurement of the underlying ground truth state of the world. So the lab makes a sample, the defractometer measures it and now you have real experimental data that tells you whether your synthesis let's say actually produced what you intended or not or what kind of things happened in the synthesis. And then I guess one thing you need for analyzing this sort of data is this thing called phase identification which is basically taking the defraction patterns from uh the the characterization step and then trying to analyze that to figure out what are actually the crystalline phases or the materials that were synthesized uh in this process.

[00:05:36]And this is where we actually start because this is basically an both an important bottleneck and weekly verifiable. Um so what you do is that you take a defraction pattern from the instrument search against the structure databases and then the model predicts which crystalline phases are present and at what probabilities and then you confirm this by complex tools just let's say read refinement where which is some sort of a physics based predicting procedures that validates whether the predicted phases actually explain the observed pattern. Now this is a really hard problem. A single diffraction pattern can maintain overlapping peaks from multiple phases. And then the model needs to reason about crystal symmetry and peak positions and relative intensities uh while all while accounting for is the instrument noise.

[00:06:25]Uh this is also the task where we apply RL the model gets better at phase identification by training on let's say lots and lots of these data from our lab with the refinement step providing the actual some sort of ground truth signal.

[00:06:38]Um and I guess uh there are a couple of reasons why we start at this analysis step. First, it's actually an important bottleneck. Today, this phase identification step is largely done manually with complex tools by expert crystalallographers. If we want to run thousands of experiments autonomously, we need to uh this step to be automated and reliable. And second, and this is critical from an ML perspective, this step is weekly verifiable. we can check whether the model's prediction against the observed XRD measurements from the characterization step. So for the rest of this talk I'll focus on how we use RL to train models for the scientific analysis task and what we have learned along this way. There's of course the harder step of basically uh let's say hypothesis generation. Well I I'll talk a bit about that in all right so what does it look like? Um so this is what happens when you scale reinforcement learning compute on uh real scientific tasks and of course underlying this plot is also the amount of data that we are using which is produced by the so and what's remarkable here is that this curves looks a lot like the scaling curves you see for math and code freeing. So so in in a sense we can say that the reinforcement learning computer scaling works for science also even when your reward signals comes from the data from a physical lab instead of some sort of unit tests. Now more importantly uh I guess there's no synthetic data here and not any textbook examples. These are using the actual measurements from the instruments from the lab all with all the noise and the complexity that entails with that. So I so I think this is really a very cool result. Uh okay. Um yeah the other thing is I guess RL actually expand the underlying LM's capabilities here and uh this plot so the plots kind of show what RL is really doing. So so the red lines here are the best of end performance which is the best answer the model produces across multiple samples and the black line is the mean and you can see at the start of the training the best of end is already around 63%.

[00:08:47]uh meaning the base model can produce correct answers. It just doesn't do that reliably. And as reinforcement learning progresses, the mean catches up to this where the best of end performance started. So RL is essentially making the model's average performance match what was its previously best case performance. But here's what's even more interesting. The best end performance is also climbing as you can see from these plots. Uh RL isn't just concentrating probability on best known solutions. is also expanding coverage, meaning it's teaching the model to solve problems it couldn't solve before at all. And I guess here's the same scaling chart I showed you, but with an additional reference point. So the dotted line here I added is the basically Kimik K 2.6 which is one of the strongest open weights model available uh run with a context line of 230K. Now this was just released last week and basically what I'm showing is that our previous RL runs with worse open weights model already surpass uh Kimk 2.6 fairly early uh in our RL training but what's exciting is that we just don't that not the fact that we beat it but basically the fact that now we can actually use this new model as a base for our future RL runs.

[00:10:00]So in a sense the open source community can provide us the base to start from and the RL can actually further improve the ceiling on the kind of tasks we care about. All right. Uh so so this talk is more going to be about scaling RL at the frontier. And this is a bit of a cheeky slide before the technical section to capture some of the discourse around RL for LLMs. Right now on the left there like basically there's a simple intuition that if any change in the RL stack does better then we should just do more of that and if it does worse we shouldn't do that more of it anymore. Uh and I I guess we reach the same conclusion on the right once we develop a deep understanding of how RL works or not. But in the middle there's a lot of discourse on let's say Twitter and the general community around u what should be doing. Everyone's debating lots of things like BPO versus GRPO versus Dapo, asymmetric clipping, whether you need like FP32 or like a lot of these other things basically. And the truth is that the core algorithm is actually simple. It's actually the engineering to make it work reliably at scale that let's say trillion plus parparameter models and messy data and all these things. And I I guess this is what the rest of the talk is going to be about like some of the practical lessons we have learned scaling our And more concretely here's kind of some of the road map uh for this talk. So basically u I'm going to talk about not the data aspect of things but more about let's say the stable on infra for large movies that's one one trillion plus parameters which communicative is and what does it entail. Uh so some of the things I will be touching on is things like train infrance discrepancy as well as async RL and easy to tune and stable algorithms.

[00:11:47]Before I get into our specific findings, um let me set up the framework we use to think about RL scaling and uh this comes from the art of scaling reinforcement learning compute paper which is I guess presented as an oral at uh this conference. So, so the key insight is that like I think RL training generally follows a sigmoidal scaling law if you're measuring things boundary metrics let's say like accuracy or rewards and you can sort of fit that with this formula that is that you see on the slide but you can also basically u interpret there are like two important things going on which is uh different design choices can affect how the asymptoic ceiling that you reach uh in for a given model and data what is the ceiling you can reach as well as how fast you reach that ceiling which is the compute efficiency term and that's basically so so what we do is we generally use this framework to think about all of our RL runs and it lets us predict whether more compute will help or whether we have hit a ceiling and need to change our approach in some sense and I I guess this is one of the main uh important findings from that paper which is uh different recipes would have actually different ceilings as well as different efficiencies. So, uh basically if you pick the wrong recipe, you can train it as long as you want, but it may not reach the ceiling of uh some other correctly picked recipe. So, basically, um the yeah, like this is basically you should think about not every recipe is equal equally good and it's not just a matter of efficiency, but actually some changes might lead to a different ceiling itself.

[00:13:30]And one of the things that can really affect ceiling is this thing called faint inference mismatch in on policy RL. Um so basically let's say even if you're doing on policy meaning you're exactly using the same weights for uh inference and training meaning you're generating rollouts using the same weights as well as training using these exact weights. uh this is still not exactly on policy and then the reason is that you actually have two separate engines when doing any sort of RL for LMS you have the sampler which is let's say things like MLM or ST line are the typical sampler engines that we use which generates the rollouts and then you have the learner engine which is things like let's say your megatron which compute where we compute the policy gradient and the loss and compute the weight update so that we need to update the weights um and policy gradient typically ally assumes that these are the same model uh that you're sampling actions from some policy and you're exactly updating that policy but in reality if you do on policy RL with a fixed set of weights you end up doing something like this that where you're sampling from the sampler policy and you're training on the learning policy even though they have the same underlying weights there is a mismatch between the log probabilities or probabilities they assign to different tokens because of the fact that they use different kernels maybe even different numerical positions and a bunch of other things and these all sort of things will compound together and and I guess the scatter plot here shows you the mismatch uh which is basically if you take a policy and then you run it through the sampler and training engine you expect that all the tokens would lie on this y equal to x line which is basically uh all the tokens probabilities are exactly the same between a training and sampler engine. uh but you actually see a very reason like very visible spread. Um some tokens have probability estimates that can even differ by let's say point4 which is like pretty large diff in the probability difference. So so inference and because uh yeah so so what happens is that there is an underlying mismatch in your training objective itself even when we're doing on polic and the thing is that this mismatch is actually worse for um what happens is that if you train um with this sort of mismatch uh generally what we see is that training works fine until it doesn't So this is like the sort of effect we see often that uh basically it's it's it's almost like a silent killer of your runs. Your RL runs would look really fine. And this is by the way for the plot here is shown for a large MOE by this paper uh where they try to fix everything except they're just reducing the amount of mismatch between trainer and infrance by actually fixing some of the discrepancy sources uh one by one. And you can see that once they fix enough of these sources, you get stable learning. And I guess this is what I mean by basically RL really wants to work for small LMS because this mismatch is not that pronounced but you actually have to make it work for larger movies and let's say the models like communic or the other frontier open weights models. All right. So one of the common things or one of the tricks we do to mitigate the strain inference discrepancy is this thing called router replay and what it does is the basically in a mixture of experts model the router decides uh which experts process each token and the problem is that the inference engine and the training engine might route tokens to different experts uh even with identical weights due to the let's say the mismatch I just talked about. So the fix is basically simple.

[00:17:15]So during roll out you actually record all these routing decisions like which experts were selected for each which token and then during training forward pass instead of rerunning the router you actually replay these saved decisions and this guarantees the same experts are activated so that the log probabilities actually match. Um and basically what we sorry basically what we see is that there is a substantial reduction in the expected sample size. Uh yeah so there is sorry a substantial reduction in the expected sample size if we don't do this router replay thingy but if once we add the router replay we see the expected sample size or ESS for short actually improves and generally higher ESS is better. this b a diagnostic to tell how stable your learning is and u we see a sizable impact we're doing something like this now of course uh router plays in free there are multiple costs associated with this the first one is infra complexity you do need to plum these routing decisions from the inference engine through to the training engine and uh so this is like a non-triminal amount of systems engineering and second is that there is an overhead because we are actually saving and playing these decisions. So it adds cost both to the both training and inference and basically if you use the default implementation from VLM then this actually causes you um like degrades your throughput much more while our in-house implementation is about 60% faster and this is done by optimizing how we capture as well as transfer this routing data. Uh and here's a fun one. Sometimes the mismatch isn't even in your code. Uh it's in the GPU kernels themselves. So what one fun bug we discovered was actually when we wanted to do sequence packing which is a very common optimization so that rather than padding uh you can actually train uh by packing multiple sequence so that you can utilize your max context length as much as possible and change your attention m so the training is still optimizing the same loss. Uh so this is like what is shown on the right hand side. uh once we switched to that we started seeing a much larger discrepancy between train and inference and on digging further what we found was that the causal and the padding causal kernels actually produce different retention outputs for the same inputs. So we typically use the causal kernel during the auto regressive generation but during uh if you're using sequence passing then we use this thing called padding causal and that actually has this huge mismatch and apparently the mismatch was four times larger uh between these two kernels. So in terms of the kale divergence so basically uh you can see like some sometimes you also run into when you're doing at the frontier you can also run into these kind of kernel bugs that can actually create a huge amount of mismatch. Uh the nice thing is at least Nvidia is currently working on an upstream fix for this thing so that everyone can use this. Um all right so switching gears the second major challenge in large scale RL is that it has to be asynchronous and what I mean by that is let's say the default synchronous RL is actually very wasteful where you basically alternate between generating and training so you generate a bunch of rollouts and then you basically train on these rollouts and so on and so forth and this process is actually very wasteful and the reason is because let's say the rollouts actually take a variable amount of times to finish and your GPUs are basically waiting idle or they are actually sitting idle for your longest roll out to finish. So you can see basically how wasteful this is like all the time that the the longest rollout hasn't finished the other GPUs are just sitting idle. Uh clearly we there is something better and this thing better thing is called async where we try to overlap some of the generation of the rollouts with the training. uh so there's not as much waiting time u so basically what happens is as soon as a rule out is generated it goes into a buffer while the training is trying to pull batches from that buffer so you can generate let's say two batches worth of data while the trainer starts working on this first first batch the generator is already trying to generate the second batch which is what I'm showing here now there's a even better way of doing async RL which is called pipeline RL and what we can do here is that rather than having some sort of u trainer pulling data from a buffer and inference engine because because the thing is if you go back to the the default image there is still some idle time which is the once the the longest roll out time uh what's once let's say we have finished generating almost all the rollouts except some of the longer runs the GPUs are now sitting book instead of that what pipeline RL says is that basically inference runs continuously at a near constant bat size which is as soon as your one rollout finishes a new prompt immediately takes its slots. So you're always basically running inference. Now during this time uh the trainer resolves training which means you need to update the weights of the inference and this is where the fun stuff happens that you rather than actually uh you figure out a way so that you pause the inference temporarily do the weight update and then you continue generating and this actually has a huge impact on computational efficiency and as you can see the default way of doing off policy RL which is like what we call BP off policy is substantially much more um inefficient in terms of like the scaling bits we got in the scale RL work. Uh and what we have done is that we actually enabled pause and resume during beta updates where the inference requests are not drained and this is like a very important piece for us so as to squeeze out efficiency from our infrastructure.

[00:23:09]Now one thing you notice is that the async RL would use stale intra meaning the policy that generated the data may not be exactly the policy that is being trained right now and this is commonly known as off policy RL training. Uh and a common way to deal with this is basically this thing called important sampling. Uh so the oral objective is to maximize expected reward under the current policy but we're sampling from an old policy and what important sampling does is is this corrects for this by rebating each sample by the ratio of the current pro policies probability to the old policies probability. So this is actually very simple to do and now there are different variations and different things how cool uses and this is this brings me to this slide that not all ways or all algorithms are doing so are created equal. Um and one thing I would say is so here I'm showing you some results where I guess things like um let's say some of these methods like the clipped important sampling or sispo is actually doing much better than other things like dppel but I I guess the other thing I would mention is even if let's say these things are comparable um one thing that the scalar people pointed out was that the like the asymmetric clipping thing in do is actually extremely sensitive to hyperparameter tuning and my claim generally is that hard to algorithms are just not worth it at larger scale. And the other thing is that any new hyperparameters that you add, they add actually one more knob to tune. So, so instead uh a simple thing I follow is basically we should pick the simplest choice that works. For example, we should just drop K regularization if it works. And I think that's what we do here at periodic also. Um now the choice of method we have picked which works well enough for us is this thing called m important sampling reinforce the idea is actually very straightforward even after let's say reducing all the pain inference mismatch things like let's say router mid play or kernel fixes or a bunch of things there are still a small number of tokens where the the discrepancy between fra and inference would be large so usually this is like let's say less than 002% or maybe even smaller uh So what we do is simply mask these tokens away in the gradient update and this this simple trick is actually works pretty reliably and we see that so this is these results are shown for another communicator to run where we see that the discrepancy actually does not increase as we do more and more training. So so this makes uh training extremely stable u which is yeah this basically makes it very easy to train for really long amounts of time.

[00:25:49]All right. So, so far I covered a bunch of things. Uh, what's next? So, so I guess one thing I would say is like by focusing directly on advancing science, uh, actually different sets of machine learning infra and research questions will rise to prominence. So, one thing we think a lot about is let's say the LLM proposes a material synthesis experiment, then RL can connect it to verification or RL can be used with verifications at different fidelity levels. For example, you can have these neural physics models that can give you a signal in minutes. This is very fast but very approximate. And then you can use physics simulations that can take several minutes, sometimes even hours.

[00:26:28]Uh this is more accurate but like I guess still somewhat approximate. And then the final real signal is actually try to synthesize this material in the lab and then you get the ground truth signal from the the experimental results itself and this will take actually multiple hours.

[00:26:44]Now in this example basically simulations and physical laboratories as reward function are actually very high latency. Compare that to let's say your the neural physics model or a typical reward model people use. Uh and there's no way really around this in the sense that many interesting questions in life aren't instantly answerable. So you will find a lot of basically domains or other questions where you would face this kind of thing. And one thing that is needed for this is that we need to produce async stacks that are actually resilient to very very stale of policy samples. Uh which is key for achieving key throughput. But the other thing which will also come into play is that we also have to worry about data efficiency again because we can only do so many experiments in the lab and it's not going to be at the scale of let's say math or code where we can run let's say even hundreds of thousands of things uh or rollouts and get reward for any of them. That's not just true in this domain.

[00:27:43]And finally the basically we started with the analysis step but the real prize is the hypothesis step and that is the holy grail for LLMs in science which is like the question of can an AI look at experimental results synthesize what it's learned and propose generally no hypothesis and not just interpolate from non-materials but actually make some sort of creatively like what if it tried this unusual combination of elements at this temperature and stuff like that and and this is where I think RL gets really exciting. Uh so right now our models are mostly learning to let's say analyze data better. But the next step is training them to propose better experiments and to close the full loop autonomously. So the reward signal would be things like let's say if you propose did your proposed recipe actually produce an interesting table and other ways to think about this. All right. I think this is a good point to end my talk. So let me know if you have any questions. I think I would be available to answer some of them.

[00:28:39]Yeah.

相关推荐

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

热门趋势

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30