Episode #459 from 2:55:23
OpenAI o3-mini vs DeepSeek r1
All right. So, we have fun things happening in real time. This is a good opportunity to talk about other reasoning models, o1, o3, just now OpenAI, as perhaps expected, released o3-mini. What are we expecting from the different flavors? Can you just lay out the different flavors of the o models and from Gemini, the reasoning model? Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet, you do this large scale reasoning training with reinforcement learning, and then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did reasoning heavy, but very standard post-training techniques after the large scale reasoning RL. So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math heavy.
Why this moment matters
All right. So, we have fun things happening in real time. This is a good opportunity to talk about other reasoning models, o1, o3, just now OpenAI, as perhaps expected, released o3-mini. What are we expecting from the different flavors? Can you just lay out the different flavors of the o models and from Gemini, the reasoning model? Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet, you do this large scale reasoning training with reinforcement learning, and then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did reasoning heavy, but very standard post-training techniques after the large scale reasoning RL. So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math heavy.