Episode #447 from 50:54
To what degree do you use agentic approaches? How useful are agents? We think agents are really, really cool.
People
Topics
Introduction
0:00
The following is a conversation with the founding members of the Cursor team, Michael Truell, Sualeh Asif, Arvid Lunnemark, and Aman Sanger. Cursor is a code editor based on VS Code that adds a lot of powerful features for AI-assisted coding. It has captivated the attention and excitement of the programming and AI communities. So I thought this is an excellent opportunity to dive deep into the role of AI in programming. This is a super technical conversation that is bigger than just about one code editor. It's about the future of programming and in general, the future of human AI collaboration in designing and engineering complicated and powerful systems. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Michael, Sualeh, Arvid and Aman.
Code editor basics
0:59
All right, this is awesome. We have Michael, Aman, Sualeh, Arvid here from the Cursor team. First up, big ridiculous question. What's the point of a code editor? So the code editor is largely the place where you build software and today or for a long time, that's meant the place where you text edit a formal programming language. And for people who aren't programmers, the way to think of a code editor is a really souped up word processor for programmers, where the reason it's souped up is code has a lot of structure. And so the "word processor," the code editor can actually do a lot for you that word processors sort of in the writing space haven't been able to do for people editing texts there.
GitHub Copilot
3:09
So for people who don't know, Cursor is this super cool new editor that's a fork of VS Code. It would be interesting to get your explanation of your own journey of editors. I think all of you were big fans of VS Code with Copilot. How did you arrive to VS Code and how did that lead to your journey with Cursor? Yeah, so I think a lot of us... Well, all of us were originally [inaudible 00:03:39] users.
Cursor
10:27
Okay, so can we take it all the way to Cursor. And what is Cursor? It's a fork of VS Code and VS Code is one of the most popular editors for a long time. Everybody fell in love with it. Everybody left Vim, I left DMAX for it. Sorry. So unified in some fundamental way the developer community. And then you look at the space of things, you look at the scaling laws, AI is becoming amazing and you decided okay, it's not enough to just write an extension via VS Code because there's a lot of limitations to that. If AI is going to keep getting better and better and better, we need to really rethink how the AI is going to be part of the editing process. And so you decided to fork VS Code and start to build a lot of the amazing features we'll be able to talk about. But what was that decision like? Because there's a lot of extensions, including Copilot, of VS Code that are doing sort of AI type stuff. What was the decision like to just fork VS Code? So the decision to do an editor seemed kind of self-evident to us for at least what we wanted to do and achieve because when we started working on the editor, the idea was these models are going to get much better, their capabilities are going to improve and it's going to entirely change how you build software, both in a you will have big productivity gains but also radical and now the active building software is going to change a lot. And so you're very limited in the control you have over a code editor if you're a plugin to an existing coding environment and we didn't want to get locked in by those limitations. We wanted to be able to just build the most useful stuff.
Cursor Tab
16:54
One of the things we really wanted was we wanted the model to be able to edit code for us. That was kind of a wish and we had multiple attempts at it before we had a good model that could edit code for you. Then after we had a good model, I think there've been a lot of effort to make the inference fast for having a good experience, and we've been starting to incorporate... I mean, Michael sort of mentioned this ability to jump to different places and that jump to different places I think came from a feeling of once you accept an edit, it's like man, it should be just really obvious where to go next. It's like I'd made this change, the model should just know that the next place to go to is 18 lines down. If you're a WIM user, you could press 18JJ or whatever, but why am I doing this? The model should just know it. So the idea was you just press Tab, it would go 18 lines down and then show you the next edit and you would press Tab, so as long as you could keep pressing Tab. And so the internal competition was, how many Tabs can we make someone press? Once you have the idea, more abstractly, the thing to think about is how are the edits zero entropy? So once you've expressed your intent and the edit is... There's no new bits of information to finish your thought, but you still have to type some characters to make the computer understand what you're actually thinking, then maybe the model should just sort of read your mind and all the zero entropy bits should just be like tabbed away. That was sort of the abstract version.
Code diff
23:08
As we're talking about this, I should mention one of the really cool and noticeable things about Cursor is that there's this whole diff interface situation going on. So the model suggests with the red and the green of here's how we're going to modify the code, and in the chat window you can apply and it shows you the diff and you can accept the diff. So maybe can you speak to whatever direction of that? We'll probably have four or five different kinds of diffs. So we have optimized the diff for the auto complete, so that has a different diff interface than when you're reviewing larger blocks of code. And then we're trying to optimize another diff thing for when you're doing multiple different files. And at a high level, the difference is for when you're doing auto- complete, it should be really, really fast to read. Actually it should be really fast to read in all situations, but in auto-complete your eyes are focused in one area, you can't be in too many... The humans can't look in too many different places.
ML details
31:20
I'm really feeling the AGI with this editor. It feels like there's a lot of machine learning going on underneath. Tell me about some of the ML stuff that makes it all work? Where Cursor really works via this ensemble of custom models that we've trained alongside the frontier models that are fantastic at the reasoning intense things. And so Cursor Tab for example, is a great example of where you can specialize this model to be, even better than even frontier models if you look at evals on the task we set it at. The other domain, which it's surprising that it requires custom models but it's necessary and works quite well, is in Apply. So I think these models are... The frontier models are quite good at sketching out plans for code and generating rough sketches of the change, but actually, creating diffs is quite hard for frontier models, for your training models. You try to do this with Sonnet, with o1, any frontier model and it really messes up stupid things like counting line numbers, especially in super, super large files. And so what we've done to alleviate this is we let the model sketch out this rough code block that indicates what the change will be and we train a model to then Apply that change to the file.
GPT vs Claude
36:54
Well, let me ask the ridiculous question of which LLM is better at coding? GPT, Claude, who wins in the context of programming? And I'm sure the answer is much more nuanced because it sounds like every single part of this involves a different model. I think there's no model that Pareto dominates others, meaning it is better in all categories that we think matter, the categories being speed, ability to edit code, ability to process lots of code, long context, a couple of other things and coding capabilities. The one that I'd say right now is just net best is Sonnet. I think this is a consensus opinion. o1's really interesting and it's really good at reasoning. So if you give it really hard programming interview style problems or lead code problems, it can do quite well on them, but it doesn't feel like it understands your rough intent as well as Sonnet does. If you look at a lot of the other frontier models, one qualm I have is it feels like they're not necessarily over... I'm not saying they train on benchmarks, but they perform really well in benchmarks relative to everything that's in the middle. So if you tried on all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, they'll do really well. But when you push them a little bit outside of that, Sonnet is I think the one that does best at maintaining that same capability. You have the same capability in the benchmark as when you try to instruct it to do anything with coding.
Prompt engineering
43:28
What's the role of a good prompt in all of this? We mentioned that benchmarks have really structured, well-formulated prompts. What should a human be doing to maximize success and what's the importance of what the humans... You wrote a blog post on... You called it Prompt Design. Yeah, I think it depends on which model you're using, and all of them are slightly different and they respond differently to different prompts, but I think the original GPT-4 and the original [inaudible 00:44:07] models last year, they were quite sensitive to the prompts, and they also had a very small context window. And so we have all of these pieces of information around the code base that would maybe be relevant in the prompt. You have the docs, you have the files that you add, you have the conversation history, and then there's a problem like how do you decide what you actually put in the prompt and when you have a limited space? And even for today's models, even when you have long context, filling out the entire context window means that it's slower. It means that sometimes the model actually gets confused and some models get more confused than others.
AI agents
50:54
Running code in background
1:04:51
Arvid, you wrote a blog post Shadow Workspace: Iterating on Code in the Background. So what's going on [inaudible 01:04:59]? So to be clear, we want there to be a lot of stuff happening in the background, and we're experimenting with a lot of things. Right now, we don't have much stuff happening other than the cache warming or figuring out the right context that goes into your command key prompts for example. But the idea is if you can actually spend computation in the background, then you can help the user maybe at a slightly longer time horizon than just predicting the next few lines that you're going to make. But actually in the next 10 minutes, what are you going to make? And by doing it in background, you can spend more computation doing that. And so the idea of the Shadow Workspace that we implemented, and we use it internally for experiments is that to actually get advantage of doing stuff in the background, you want some kind of feedback signal to give back to the model because otherwise you can get higher performance by just letting the model think for longer, and so o1 is a good example of that.
Debugging
1:09:31
That's such an exciting future by the way. It's a bit of a tangent, but to allow a model to change files, it's scary for people but it's really cool, to be able to just let the agent do a set of tasks and you come back the next day and kind of observe like it's a colleague or something like that. And I think there may be different versions of runability where, for the simple things where you're doing things in the span of a few minutes on behalf of the user as they're programming, it makes sense to make something work locally in their machine. I think for the more aggressive things where you're making larger changes that take longer periods of time, you'll probably want to do this in some sandbox remote environment and that's another incredibly tricky problem of how do you exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code the user's environment with this remote sandbox.
Dangerous code
1:14:58
I mean, but this is hard for humans too to understand which line of code is important, which is not. I think one of your principles on a website says if a code can do a lot of damage, one should add a comment that say, "This line of code is dangerous." And all caps, repeated 10 times.
Branching file systems
1:26:09
How much interaction is there between the terminal and the code? How much information is gained from if you run the code in the terminal? Can you do a loop where it runs the code and suggests how to change the code? If the code and runtime gets an error? Is right now there's separate worlds completely? I know you can do control K inside the terminal to help you write the code. You can use terminal context as well inside of check command K kind of everything. We don't have the looping part yet, so we suspect something like this could make a lot of sense. There's a question of whether it happens in the foreground too or if it happens in the background like what we've been discussing.
Scaling challenges
1:29:20
Okay. Is there some interesting challenges to... You guys are pretty new startup to scaling, to so many people and- Yeah, I think that it has been an interesting journey adding each extra zero to the request per second. You run into all of these with the general components you're using for caching and databases, run into issues as you make things bigger and bigger, and now we're at the scale where we get into overflows on our tables and things like that. And then also there have been some custom systems that we've built. For instance, our retrieval system for computing, a semantic index of your code base and answering questions about a code base that have, continually, I feel like been one of the trickier things to scale.
Context
1:43:32
Yeah. On the topic of a context, that's actually been a friction for me. When I'm writing code in Python, there's a bunch of stuff imported. You could probably intuit the kind of stuff I would like to include in the context. How hard is it to auto figure out the context? It's tricky. I think we can do a lot better at computing the context automatically in the future. One thing that's important to note is, there are trade-offs with including automatic context. So the more context you include for these models, first of all, the slower they are and the more expensive those requests are, which means you can then do less model calls and do less fancy stuff in the background. Also, for a lot of these models, they get confused if you have a lot of information in the prompt. So the bar for accuracy and for relevance of the context you include should be quite high. Already, we do some automatic context in some places within the product. It's definitely something we want to get a lot better at. I think that there are a lot of cool ideas to try there, both on the learning better retrieval systems, like better embedding models, better rerankers.
OpenAI o1
1:48:39
Let me ask you about OpenAI o1. What do you think is the role of that kind of test time compute system in programming? I think test time compute is really, really interesting. So there's been the pre-training regime which will, as you scale up the amount of data and the size of your model, get you better and better performance both on loss and then on downstream benchmarks and just general performance. So we use it for coding or other tasks. We're starting to hit a bit of a data wall. Meaning, it's going to be hard to continue scaling up this regime. So scaling up test time compute is an interesting way, if now increasing the number of inference time flops that we use but still getting... Yeah, as you increase the number of flops you use inference time getting corresponding improvements in the performance of these models. Traditionally, we just had to literally train a bigger model that always used that many more flops, but now, we could perhaps use the same size model and run it for longer to be able to get an answer at the quality of a much larger model.
Synthetic data
2:00:01
All right. From that profound answer- All right, from that profound answer, let's descend back down to the technical. You mentioned you have a taxonomy of synthetic data.
RLHF vs RLAIF
2:03:48
What about RL with feedback side RLHF versus RLAIF? What's the role of that in getting better performance on the models? Yeah. So RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
Fields Medal for AI
2:05:34
What's your intuition when you compare generation and verification or generation and ranking? Is ranking way easier than generation? My intuition would just say, yeah, it should be. This is going back to... Like, if you believe P does not equal NP, then there's this massive class of problems that are much, much easier to verify given proof, than actually proving it.
Scaling laws
2:08:17
Speaking of how fast things have been going, let's talk about scaling laws. So for people who don't know, maybe it's good to talk about this whole idea of scaling laws. What are they, where'd you think stand, and where do you think things are going? I think it was interesting. The original scaling laws paper by open AI was slightly wrong. Because I think of some issues they did with learning right schedules. And then Chinchilla showed a more correct version. And then from then people have again deviated from doing the compute optimal thing. Because people start now optimizing more so for making the thing work really well given an inference budget.
The future of programming
2:17:06
But also, these big labs like winning. So they're just going wild. Okay, so big question, looking out into the future: You're now at the center of the programming world. How do you think programming, the nature of programming changes in the next few months, in the next year, in the next two years and the next five years, 10 years? I think we're really excited about a future where the programmer is in the driver's seat for a long time. And you've heard us talk about this a little bit, but one that emphasizes speed and agency for the programmer and control. The ability to modify anything you want to modify, the ability to iterate really fast on what you're building. And this is a little different, I think, than where some people are jumping to in the space, where I think one idea that's captivated people, is can you talk to your computer? Can you have it build software for you? As if you're talking to an engineering department or an engineer over Slack. And can it just be this sort of isolated text box? And part of the reason we're not excited about that, is some of the stuff we've talked about with latency, but then a big piece, a reason we're not excited about that, is because that comes with giving up a lot of control.