Episode #416 from 1:43:48
Basically. Can Meta afford to do that?
People
Topics
Introduction
0:00
I see the danger of this concentration of power through proprietary AI systems as a much bigger danger than everything else. What works against this is people who think that for reasons of security, we should keep AI systems under lock and key because it's too dangerous to put it in the hands of everybody. That would lead to a very bad future in which all of our information diet is controlled by a small number of companies who proprietary systems. I believe that people are fundamentally good, and so if AI, especially open source AI can make them smarter, it just empowers the goodness in humans.
Limits of LLMs
2:18
At this moment of rapid AI development, this happens to be somewhat a controversial position, and so it's been fun seeing Yann get into a lot of intense and fascinating discussions online as we do in this very conversation. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Yann LeCun. You've had some strong statements, technical statements about the future of artificial intelligence throughout your career actually, but recently as well, you've said that autoregressive LLMs are not the way we're going to make progress towards superhuman intelligence. These are the large language models like GPT-4, like Llama 2 and 3 soon and so on. How do they work and why are they not going to take us all the way? For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan. Those are four essential characteristics of intelligent systems or entities, humans, animals. LLMs can do none of those or they can only do them in a very primitive way and they don't really understand the physical world. They don't really have persistent memory. They can't really reason and they certainly can't plan. And so if you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake. That is not to say that autoregressive LLMs are not useful. They're certainly useful, that they're not interesting, that we can't build a whole ecosystem of applications around them. Of course we can, but as a pass towards human-level intelligence, they're missing essential components.
Bilingualism and thinking
13:54
That's called autoregressive prediction, which is why those LLMs should be called autoregressive LLMs, but we just call them LLMs, and there is a difference between this kind of process and a process by which before producing a word... When you and I talk, you and I are bilingual, we think about what we're going to say, and it's relatively independent of the language in which we're going to say. When we talk about, I don't know, let's say a mathematical concept or something, the kind of thinking that we're doing and the answer that we're planning to produce is not linked to whether we're going to see it in French or Russian or English. Chomsky just rolled his eyes, but I understand, so you're saying that there's a bigger abstraction that goes before language and maps onto language?
Video prediction
17:46
So really goes to the... I think the fundamental question is can you build a really complete world model, not complete, but one that has a deep understanding of the world? Yeah. So can you build this first of all by prediction, and the answer is probably yes. Can you build it by predicting words? And the answer is most probably no, because language is very poor in terms of weak or low bandwidth if you want, there's just not enough information there. So building world models means observing the world and understanding why the world is evolving the way it is, and then the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take.
JEPA (Joint-Embedding Predictive Architecture)
25:07
What is joint embedding? What are these architectures that you're so excited about? Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version, you run them both through encoders, which in general, are identical, but not necessarily. And then you train a predictor on top of those encoders to predict the representation of the full input from the representation of the corrupted one. So joint embedding, because you're taking the full input and the corrupted version or transformed version, run them both through encoders, you get a joint embedding, and then you're saying, can I predict the representation of the full one from the representation of the corrupted one?
JEPA vs LLMs
28:15
So what is the fundamental difference between joint embedding architectures and LLMs? Can JEPA take us to AGI? Whether we should say that you don't like the term AGI, and we'll probably argue I think every single time I've talked to you, we've argued about the G in AGI. Yes.
DINO and I-JEPA
37:31
So what kind of data are we talking about here? So there's several scenario, one scenario is you take an image, you corrupt it by changing the cropping, for example, changing the size a little bit, maybe changing the orientation, blurring it, changing the colors, doing all kinds of horrible things to it.
V-JEPA
38:51
So that's the I-JEPA. It doesn't need to know that it's an image for example, because the only thing it needs to know is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that, that are really image specific. A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so a whole segment of each frame in the video over the entire video. And that tube was statically positioned throughout the frames, just literally it's a straight tube.
Hierarchical planning
44:22
So yes, for a model predictive control, but you also often talk about hierarchical planning. Can hierarchical planning emerge from this somehow? Well, so no, you will have to build a specific architecture to allow for hierarchical planning. So hierarchical planning is absolutely necessary if you want to plan complex actions. If I want to go from, let's say from New York to Paris, it's the example I use all the time, and I'm sitting in my office at NYU, my objective that I need to minimize is my distance to Paris. At a high level, a very abstract representation of my location, I would have to decompose this into two sub goals. First one is go to the airport, second one is catch a plane to Paris. Okay, so my sub goal is now going to the airport. My objective function is my distance to the airport. How do I go to the airport where I have to go in the street and hail a taxi, which you can do in New York.
Autoregressive LLMs
50:40
I would love to sort of linger on your skepticism around auto regressive LLMs. So one way I would like to test that skepticism is everything you say makes a lot of sense, but if I apply everything you said today and in general to I don't know, 10 years ago, maybe a little bit less, no, let's say three years ago, I wouldn't be able to predict the success of LLMs. So does it make sense to you that autoregressive LLMs are able to be so damn good? Yes.
AI hallucination
1:06:06
I think in one of your slides, you have this nice plot that is one of the ways you show that LLMs are limited. I wonder if you could talk about hallucinations from your perspectives, the why hallucinations happen from large language models and to what degree is that a fundamental flaw of large language models? Right, so because of the autoregressive prediction, every time an produces a token or a word, there is some level of probability for that word to take you out of the set of reasonable answers. And if you assume, which is a very strong assumption, that the probability of such error is that those errors are independent across a sequence of tokens being produced. What that means is that every time you produce a token, the probability that you stay within the set of correct answer decreases and it decreases exponentially.
Reasoning in AI
1:11:30
The type of reasoning that takes place in LLM is very, very primitive, and the reason you can tell is primitive is because the amount of computation that is spent per token produced is constant. So if you ask a question and that question has an answer in a given number of token, the amount of computation devoted to computing that answer can be exactly estimated. It's the size of the prediction network with its 36 layers or 92 layers or whatever it is multiply by number of tokens, that's it. And so essentially, it doesn't matter if the question being asked is simple to answer, complicated to answer, impossible to answer because it's a decidable or something, the amount of computation the system will be able to devote to the answer is constant or is proportional to number of token produced in the answer. This is not the way we work, the way we reason is that when we're faced with a complex problem or a complex question, we spend more time trying to solve it and answer it because it's more difficult. There's a prediction element, there's an iterative element where you're adjusting your understanding of a thing by going over and over and over, there's a hierarchical elements on. Does this mean it's a fundamental flaw of LLMs or does it mean that-
Reinforcement learning
1:29:02
And then that classification system works really nicely, okay. Well, so to summarize, you recommend in a spicy way that only Yann LeCun can, you recommend that we abandon generative models in favor of joint embedding architectures? Yes.
Woke AI
1:34:10
Now, a lot of people have been very critical of the recently released Google's Gemini 1.5 for essentially, in my words, I could say super woke in the negative connotation of that word. There is some almost hilariously absurd things that it does, like it modifies history like generating images of a black George Washington, or perhaps more seriously something that you commented on Twitter, which is refusing to comment on or generate images or even descriptions of Tiananmen Square or The Tank Man, one of the most legendary protest images in history. Of course, these images are highly censored by the Chinese government and therefore, everybody started asking questions of what is the process of designing these LLMs? What is the role of censorship and all that kind of stuff? So you commented on Twitter saying that open source is the answer. Yeah.
Open source
1:43:48
AI and ideology
1:47:26
The fundamental criticism that Gemini is getting is that as you point out on the West Coast, just to clarify, we're currently on the East Coast where I would suppose Meta AI headquarters would be. So there are strong words about the West Coast, but I guess the issue that happens is I think it's fair to say that most tech people have a political affiliation with the left wing. They lean left. So the problem that people are criticizing Gemini with is that there's in that de-biasing process that you mentioned, that their ideological lean becomes obvious. Is this something that could be escaped? You're saying open source is the only way. Yes.
Marc Andreesen
1:49:58
Yeah. Marc Andreessen just tweeted today. Let me do a TL;DR. The conclusion is only startups and open source can avoid the issue that he's highlighting with big tech. He's asking, "Can Big Tech actually field generative AI products?" (1) Ever-escalating demands from internal activists, employee mobs, crazed executives, broken boards, pressure groups, extremist regulators, government agencies, the press, in quotes, "experts" and everything corrupting the output. (2) Constant risk of generating a bad answer or drawing a bad picture or rendering a bad video who knows what is going to say or do at any moment. (3) Legal exposure, product liability, slander, election law, many other things and so on, anything that makes Congress mad. (4) Continuous attempts to tighten grip on acceptable output, degrade the model, how good it actually is, in terms of usable and pleasant to use and effective and all that kind of stuff. (5) Publicity of bad text, images, video actual puts those examples into the training data for the next version and so on. So he just highlights how difficult this is from all kinds of people being unhappy. He said you can't create a system that makes everybody happy.
Llama 3
1:57:56
Yeah, and Hans Moravec comes to light once again. Just to linger on LLaMA, Marc announced that LLaMA 3 is coming out eventually. I don't think there's a release date, but what are you most excited about? First of all, LLaMA 2 that's already out there and maybe the future a LLaMA 3, 4, 5, 6, 10, just the future of the open source under Meta? Well, a number of things. So there's going to be various versions of LLaMA that are improvements of previous LLaMAs, bigger, better, multimodal, things like that. Then in future generations, systems that are capable of planning that really understand how the world works, maybe are trained from video, so they have some world model maybe capable of the type of reasoning and planning I was talking about earlier. How long is that going to take? When is the research that is going in that direction going to feed into the product line if you want of LLaMA? I don't know. I can't tell you. There's a few breakthroughs that we have to basically go through before we can get there, but you'll be able to monitor our progress because we publish our research. So last week we published the V-JEPA work, which is a first step towards training systems for video.
AGI
2:04:20
You often say that a GI is not coming soon, meaning not this year, not the next few years, potentially farther away. What's your basic intuition behind that? So first of all, it's not going to be an event. The idea somehow, which is popularized by science fiction and Hollywood, that somehow somebody is going to discover the secret to AGI or human-level AI or AMI, whatever you want to call it, and then turn on a machine and then we have AGI, that's just not going to happen. It's not going to be an event. It's going to be gradual progress. Are we going to have systems that can learn from video how the world works and learn good representations? Yeah. Before we get them to the scale and performance that we observe in humans it's going to take quite a while. It's not going to happen in one day. Are we going to get systems that can have large amount of associated memory so they can remember stuff? Yeah, but same, it's not going to happen tomorrow. There is some basic techniques that need to be developed. We have a lot of them, but to get this to work together with a full system is another story.
AI doomers
2:08:48
So you push back against what are called AI doomers a lot. Can you explain their perspective and why you think they're wrong? Okay, so AI doomers imagine all kinds of catastrophe scenarios of how AI could escape or control and basically kill us all, and that relies on a whole bunch of assumptions that are mostly false. So the first assumption is that the emergence of super intelligence is going to be an event, that at some point we're going to figure out the secret and we'll turn on a machine that is super intelligent, and because we'd never done it before, it's going to take over the world and kill us all. That is false. It's not going to be an event. We're going to have systems that are as smart as a cat, have all the characteristics of human-level intelligence, but their level of intelligence would be like a cat or a parrot maybe or something. Then we're going to work our way up to make those things more intelligent. As we make them more intelligent, we're also going to put some guardrails in them and learn how to put some guardrails so they behave properly.
Joscha Bach
2:24:38
So let me ask you on your, like I said, you do get a little bit flavorful on the internet. Joscha Bach tweeted something that you LOL'd at in reference to HAL 9,000. Quote, "I appreciate your argument and I fully understand your frustration, but whether the pod bay doors should be opened or closed is a complex and nuanced issue." So you're at the head of Meta AI. This is something that really worries me, that our AI overlords will speak down to us with corporate speak of this nature, and you resist that with your way of being. Is this something you can just comment on, working at a big company, how you can avoid the over fearing, I suppose, through caution create harm? Yeah. Again, I think the answer to this is open source platforms and then enabling a widely diverse set of people to build AI assistance that represent the diversity of cultures, opinions, languages, and value systems across the world so that you're not bound to just be brainwashed by a particular way of thinking because of a single AI entity. So, I think it's a really, really important question for society. And the problem I'm seeing is that, which is why I've been so vocal and sometimes a little sardonic about it-
Humanoid robots
2:28:51
Well, it'll be at the very least, absurdly comedic. Okay. So since we talked about the physical reality, I'd love to ask your vision of the future with robots in this physical reality. So many of the kinds of intelligence that you've been speaking about would empower robots to be more effective collaborators with us humans. So since Tesla's Optimus team has been showing us some progress on humanoid robots, I think it really reinvigorated the whole industry that I think Boston Dynamics has been leading for a very, very long time. So now there's all kinds of companies Figure AI, obviously Boston Dynamics. Unitree.
Hope for the future
2:38:00
Yeah, there's a lot involved. It's a super complex task and once again, we take it for granted. What hope do you have for the future of humanity? We're talking about so many exciting technologies, so many exciting possibilities. What gives you hope when you look out over the next 10, 20, 50, a hundred years? If you look at social media, there's wars going on, there's division, there's hatred, all this kind of stuff that's also part of humanity. But amidst all that, what gives you hope? I love that question. We can make humanity smarter with AI. AI basically will amplify human intelligence. It's as if every one of us will have a staff of smart AI assistants. They might be smarter than us. They'll do our bidding, perhaps execute a task in ways that are much better than we could do ourselves, because they'd be smarter than us. And so it's like everyone would be the boss of a staff of super smart virtual people. So we shouldn't feel threatened by this any more than we should feel threatened by being the manager of a group of people, some of whom are more intelligent than us. I certainly have a lot of experience with this, of having people working with me who are smarter than me.