Episode #447 from 36:54
GPT vs Claude
Well, let me ask the ridiculous question of which LLM is better at coding? GPT, Claude, who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model. I think there's no model that Pareto dominates others, meaning it is better in all categories that we think matter, the categories being speed, ability to edit code, ability to process lots of code, long context, a couple of other things and coding capabilities. The one that I'd say right now is just net best is Sonnet. I think this is a consensus opinion. o1's really interesting and it's really good at reasoning. So if you give it really hard programming interview style problems or lead code problems, it can do quite well on them, but it doesn't feel like it understands your rough intent as well as Sonnet does. If you look at a lot of the other frontier models, one qualm I have is it feels like they're not necessarily over... I'm not saying they train on benchmarks, but they perform really well in benchmarks relative to everything that's in the middle. So if you tried on all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, they'll do really well. But when you push them a little bit outside of that, Sonnet is I think the one that does best at maintaining that same capability. You have the same capability in the benchmark as when you try to instruct it to do anything with coding.
Why this moment matters
Well, let me ask the ridiculous question of which LLM is better at coding? GPT, Claude, who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model. I think there's no model that Pareto dominates others, meaning it is better in all categories that we think matter, the categories being speed, ability to edit code, ability to process lots of code, long context, a couple of other things and coding capabilities. The one that I'd say right now is just net best is Sonnet. I think this is a consensus opinion. o1's really interesting and it's really good at reasoning. So if you give it really hard programming interview style problems or lead code problems, it can do quite well on them, but it doesn't feel like it understands your rough intent as well as Sonnet does. If you look at a lot of the other frontier models, one qualm I have is it feels like they're not necessarily over... I'm not saying they train on benchmarks, but they perform really well in benchmarks relative to everything that's in the middle. So if you tried on all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, they'll do really well. But when you push them a little bit outside of that, Sonnet is I think the one that does best at maintaining that same capability. You have the same capability in the benchmark as when you try to instruct it to do anything with coding.