Episode #490 from 1:37:18
Post-training explained: Exciting new research directions in LLMs
Yeah, there is a sense that we, together as a civilization, each individually have to find that Goldilocks zone. And in the programming context as developers. Now, we've had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. There's a lot of fun stuff in post-training. So, what are some of the interesting ideas in post-training? The biggest one from 2025 is learning this reinforcement learning with verifiable rewards, RLVR. You can scale up the training there, which means doing a lot of this kind of iterative generate-grade loop, and that lets the models learn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own and seeing the outputs, and then also that training enables this inference-time scaling very nicely. It just turned out that this paradigm was very nicely linked, where this kind of RL training enables inference-time scaling. But inference-time scaling could have been found in different ways. So, it was kind of this perfect storm where the models change a lot, and the way that they're trained is a major factor in doing so.
Why this moment matters
Yeah, there is a sense that we, together as a civilization, each individually have to find that Goldilocks zone. And in the programming context as developers. Now, we've had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. There's a lot of fun stuff in post-training. So, what are some of the interesting ideas in post-training? The biggest one from 2025 is learning this reinforcement learning with verifiable rewards, RLVR. You can scale up the training there, which means doing a lot of this kind of iterative generate-grade loop, and that lets the models learn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own and seeing the outputs, and then also that training enables this inference-time scaling very nicely. It just turned out that this paradigm was very nicely linked, where this kind of RL training enables inference-time scaling. But inference-time scaling could have been found in different ways. So, it was kind of this perfect storm where the models change a lot, and the way that they're trained is a major factor in doing so.