Episode #447 from 1:48:39
OpenAI o1
Let me ask you about OpenAI o1. What do you think is the role of that kind of test time compute system in programming? I think test time compute is really, really interesting. So there's been the pre-training regime which will, as you scale up the amount of data and the size of your model, get you better and better performance both on loss and then on downstream benchmarks and just general performance. So we use it for coding or other tasks. We're starting to hit a bit of a data wall. Meaning, it's going to be hard to continue scaling up this regime. So scaling up test time compute is an interesting way, if now increasing the number of inference time flops that we use but still getting... Yeah, as you increase the number of flops you use inference time getting corresponding improvements in the performance of these models. Traditionally, we just had to literally train a bigger model that always used that many more flops, but now, we could perhaps use the same size model and run it for longer to be able to get an answer at the quality of a much larger model.
Why this moment matters
Let me ask you about OpenAI o1. What do you think is the role of that kind of test time compute system in programming? I think test time compute is really, really interesting. So there's been the pre-training regime which will, as you scale up the amount of data and the size of your model, get you better and better performance both on loss and then on downstream benchmarks and just general performance. So we use it for coding or other tasks. We're starting to hit a bit of a data wall. Meaning, it's going to be hard to continue scaling up this regime. So scaling up test time compute is an interesting way, if now increasing the number of inference time flops that we use but still getting... Yeah, as you increase the number of flops you use inference time getting corresponding improvements in the performance of these models. Traditionally, we just had to literally train a bigger model that always used that many more flops, but now, we could perhaps use the same size model and run it for longer to be able to get an answer at the quality of a much larger model.