Episode #459 from 4:37:49
Yeah, this is- Tülu.
People
Topics
DeepSeek-R1 and DeepSeek-V3
3:33
A lot of people are curious to understand China's DeepSeek AI models, so let's lay it out. Nathan, can you describe what DeepSeek-V3 and DeepSeek-R1 are, how they work, how they're trained? Let's look at the big picture and then we'll zoom in on the details. DeepSeek-V3 is a new mixture of experts, transformer language model from DeepSeek who is based in China. They have some new specifics in the model that we'll get into. Largely this is a open weight model and it's a instruction model like what you would use in ChatGPT. They also released what is called the base model, which is before these techniques of post-training. Most people use instruction models today, and those are what's served in all sorts of applications. This was released on, I believe, December 26th or that week. And then weeks later on January 20th, DeepSeek released DeepSeek-R1, which is a reasoning model, which really accelerated a lot of this discussion.
Low cost of training
25:07
There's two main techniques that they implemented that are probably the majority of their efficiency, and then there's a lot of implementation details that maybe we'll gloss over or get into later that contribute to it. But those two main things are, one is they went to a mixture of experts model, which we'll define in a second. And then the other thing is that they invented this new technique called MLA, latent attention. Both of these are big deals. Mixture of experts is something that's been in the literature for a handful of years. And OpenAI with GPT-4 was the first one to productize a mixture of experts model. And what this means is when you look at the common models around that most people have been able to interact with that are open, think Llama. Llama is a dense model i.e. every single parameter or neuron is activated as you're going through the model for every single token you generate. Now, with a mixture of experts model, you don't do that. How does the human actually work? Is like, oh, well my visual cortex is active when I'm thinking about vision tasks and other things. My amygdala is when I'm scared. These different aspects of your brain are focused on different things. A mixture of experts, models attempts to approximate this to some extent. It's nowhere close to what a brain architecture is, but different portions of the model activate. You'll have a set number of experts in the model and a set number that are activated each time. And this dramatically reduces both your training and inference costs because now if you think about the parameter count as the total embedding space for all of this knowledge that you're compressing down during training, one, you're embedding this data in instead of having to activate every single parameter, every single time you're training or running inference, now you can just activate on a subset and the model will learn which expert to route to for different tasks.
DeepSeek compute cluster
51:25
DeepSeek is very interesting. This is where a second could take to zoom out, out of who they are first of all, right? High-Flyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere. And they have always had a significant number of GPUs, right? In the past, a lot of these high-frequency trading, algorithmic quant traders used FPGAs, but it shifted to GPUs definitely. And there's both, but GPUs especially. And High-Flyer, which is the hedge fund that owns DeepSeek, and everyone who works for DeepSeek is part of High-Flyer to some extent. Same parent company, same owner, same CEO, they had all these resources and infrastructure for trading, and then they devoted a humongous portion of them to training models, both language models and otherwise, because these techniques were heavily AI-influenced.
Export controls on GPUs to China
58:57
Can you in gentlemen actually just zoom out and also talk about the Hopper architecture, the Nvidia Hopper GPU architecture and the difference between H100 and H800, like you mentioned, the interconnects? Yeah. So there's, Ampere was the A100 and then H100 Hopper, right? People use them synonymously in the U.S. because really there's just H100 and now there's H200, right, but same thing mostly?
AGI timeline
1:09:16
So we're doing a depth-first search here on topics, taking tangent of a tangent, so let's continue on that depth-first search. You said that you're both feeling the AGI. What's your timeline? Dario is 2026 for the super powerful AI that's basically agentic to a degree where it's a real security threat, that level of AGI. What's your timeline? I don't like to attribute specific abilities because predicting specific abilities and when is very hard. I think mostly if you're going to say that I'm feeling the AGI is that I expect continued, rapid, surprising progress over the next few years. So, something like R1 is less surprising to me from DeepSeek because I expect there to be new paradigms versus ...
China's manufacturing capacity
1:18:41
And I think going back to my viewpoint is if you believe we're in this sort of stage of economic growth and change that we've been in for the last 20 years, the export controls are absolutely guaranteeing that China will win long-term. If you do not believe AI is going to make significant changes to society in the next 10 years or 5 years. Five-year timelines are sort of what the more executives and such of AI companies and even big tech companies believe. But even 10-year timelines, it's reasonable. But once you get to, hey, these timelines are below that time period, then the only way to create a sizable advantage or disadvantage for America versus China is if you constrain and compute, because talent is not really something that's constraining. China arguably has more talent, more STEM graduates, more programmers. The US can draw upon the world's people, which it does. There's tons of foreigners in the AI industry. So many of these AI teams are all people without a US passport.
Cold war with China
1:26:36
Well, so you're saying that for now, Xi Jinping has not felt the AGI, but it feels like the DeepSeek moment, there might be meetings going on now where he's going to start wearing the same t-shirt and things are going to escalate. I mean, he may have woken up last week. Liang Feng met the second command guy, and they had a meeting, and then the next day, they announced the AI subsidies, which are a trillion RMB.
TSMC and Taiwan
1:31:05
So can you explain the role of TSMC in the story of semiconductors and maybe also how the United States can break the reliance on TSMC? I don't think it's necessarily breaking the reliance. I think it's getting TSMC to build in the US. So taking a step back, TSMC produces most of the world's chips, especially on the foundry side. There's a lot of companies that build their own chips. Samsung, Intel, STMicro, Texas Instruments, Analog Devices, all these kinds of companies build their own chips, and XP, but more and more of these companies are outsourcing to TSMC and have been for multiple decades.
Best GPUs for AI
1:54:44
Can we go back to the specific detail of the different hardware? There's this nice graphic in the export controls of which GPUs are allowed to be exported and which are not. Can you explain the difference? From a technical perspective, are the H20s promising? Yeah. And I think we need to dive really deep into the reasoning aspect and what's going on there. The US has gone through multiple iterations of the export controls. This H800 was at one point allowed back in '23, but then it got canceled and by then DeepSeek had already built their cluster of, they claim, 2K. I think they actually have many more, something like 10K of those. And now this H20 is the legally allowed chip. Nvidia shipped a million of these last year to China. For context, it was four or five million GPUs. So the percentage of GPUs that were this China-specific H20 is quite high, roughly 20%, 25%, 20% or so.
Why DeepSeek is so cheap
2:09:36
Let's go into DeepSeek again. So we're in the post DeepSeek-R1 time I think, and there's two sides to this market, watching how hard it is to serve it. On one side we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer number one on the App Store is measured by velocity, so it's not necessarily saying that more people have the DeepSeek app than the ChatGPT app. But it is still remarkable. Claude has never hit the number one in the App Store, even though everyone in San Francisco is like, "Oh my god, you got to use Claude. Don't use ChatGPT." So DeepSeek hit this. They also launched an API product recently where you can ping their API and get these super long responses for R1 out. At the same time as these are out, we'll get to what's happened to them. Because the model weights for DeepSeek-R1 are openly available and the license is very friendly, the MIT license commercially available, all of these midsize companies and big companies are trying to be first to serve R1 to their users.
Espionage
2:22:55
And there's an interesting aspect of just because it's open-weights or open-source doesn't mean it can't be subverted, right? There have been many open source software bugs that have been... For example, there was a Linux bug that was found after 10 years, which was clearly a back door because somebody was like, "Why is this taking half a second to load?" This is the recent one.
Censorship
2:31:57
There's a general concern that models get censored by the companies that deploy them. So, one case where we've seen that, and maybe censorship is one word, alignment maybe via RLHF or some other way is another word. So we saw that with black Nazi image generation with Gemini. As you mentioned, we also see that with Chinese models refusing to answer what happened in June 4th, 1989, at Tiananmen Square, so how can this be avoided? And maybe can you just in general talk about how this happens, and how can it be avoided. You gave multiple examples. There's probably a few things to keep in mind here. One is the Tiananmen Square factual knowledge. How does that get embedded into the models? Two is the Gemini, what you call the black Nazi incident, which is when Gemini as a system had this extra thing put into it that dramatically changed the behavior, and then, three is what most people would call general alignment, RLHF post-training. Each of these have very different scopes in how they're applied. If you're just to look at the model weights in order to audit specific facts is extremely hard. You have to Chrome through the pre-training data and look at all of this, and then that's terabytes of files and look for very specific words or hints of the words-
Andrej Karpathy and magic of RL
2:44:52
This might be a good place to mention the eloquent and the insightful tweet of the great and the powerful Andrej Karpathy. I think he had a bunch of thoughts, but one of them, "Last thought. Not sure if this is obvious. You know something profound is coming when you're saying it's not sure if it's obvious. There are two major types of learning in both children and in deep learning. There's one, imitation learning, watch and repeat i.e. pre-training, supervised fine-tuning, and two, trial-and-error learning, reinforcement learning. My favorite simple example is AlphaGo. One, is learning by imitating expert players. Two, is reinforcement learning to win the game. Almost every single shocking result of deep learning and the source of all magic is always two.
OpenAI o3-mini vs DeepSeek r1
2:55:23
All right. So, we have fun things happening in real time. This is a good opportunity to talk about other reasoning models, o1, o3, just now OpenAI, as perhaps expected, released o3-mini. What are we expecting from the different flavors? Can you just lay out the different flavors of the o models and from Gemini, the reasoning model? Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet, you do this large scale reasoning training with reinforcement learning, and then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did reasoning heavy, but very standard post-training techniques after the large scale reasoning RL. So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math heavy.
NVIDIA
3:14:31
So, it'll get cheaper and cheaper and cheaper. The big DeepSeek R1 release freaked everybody out because of the cheaper. One of the manifestations of that is NVIDIA stock plummeted. Can you explain what happened? And also just explain this moment and if NVIDIA is going to keep winning. We are both NVIDIA bulls here, I would say. And in some ways, the market response is reasonable. NVIDIA's biggest customers in the US are major tech companies and they're spending a ton on AI. And if a simple interpretation of DeepSeek is you can get really good models without spending as much on AI. So in that capacity it's like, "Oh, maybe these big tech companies won't need to spend as much in AI and go down."
GPU smuggling
3:18:58
The more progress that AI makes or the higher the derivative of AI progress is, especially because NVIDIA's in the best place, the higher the derivative is, the sooner the market's going to be bigger and expanding and NVIDIA's the only one that does everything reliably right now. Yeah, because it's not like an NVIDIA competitor arose. It's another company that's using NVIDIA-
DeepSeek training on OpenAI data
3:25:36
Yeah. I mean that's incredibly easy, right? OpenAI publicly stated DeepSeek uses their API and they say they have evidence, right? And this is another element of the training regime, is people at OpenAI have claimed that it's a distilled model, i.e., you're taking OpenAI's model, you're generating a lot of output, and then you're training on the output in their model. And even if that's the case, what they did is still amazing by the way, what DeepSeek did, efficiency-wise. Distillation is standard practice in industry. Whether or not, if you're at a closed lab where you care about terms of service and IP closely, you distill from your own models. If you are a researcher and you're not building any products, you distill from the OpenAI models-
AI megaclusters
3:36:04
Yeah. You have to make sure to close all security vulnerabilities. So you, Dylan, collect a lot of information about each of the mega clusters for each of the major AI companies. Can you talk about the buildouts for each one that stand out? Yeah. I think the thing that's really important about these mega cluster buildouts is they're completely unprecedented in scale. US data center power consumption has been slowly on the rise and it's gone up to 2, 3% even through the cloud computing revolution. Data center consumption has a percentage of total US, and that's been over decades of data centers, etc. It's been climbing slowly, but now, 2 to 3%.
Who wins the race to AGI?
4:11:26
Let's talk about the broad AI race. Who do you think wins? We talked about Google, Meta. The default leader has been Google because of their infrastructure advantage.
AI agents
4:21:39
Do you think agents are promising? We have to talk about this. This is the excitement of the year that agents are going to rev.. This is the generic hype term that a lot of business folks are using. AI agents are going to revolutionize everything. Okay. So mostly the term agent is obviously overblown. We've talked a lot about reinforcement learning as a way to train for verifiable outcomes. Agents should mean something that is open-ended and is solving a task independently on its own and able to adapt to uncertainty. There's a lot of the term agent applied to things like Apple Intelligence, which we still don't have after the last WWDC, which is orchestrating between apps and that type of tool use thing is something that language models can do really well. Apple Intelligence I suspect will come eventually. It's a closed domain. It's your messages app integrating with your photos with AI in the background. That will work. That has been described as an agent by a lot of software companies to get into the narrative.
Programming and AI
4:30:21
Well, what do you think about the programming context? So software engineering, that's where I personally, and I know a lot of people interact with AI the most. There's a lot of fear and angst too from current CS students, but that is the area where probably the most AI revenue and productivity gains have come, right? Whether it be Copilots or Cursor or what have you, or just standard ChatGPT. I know very few programmers who don't have ChatGPT and actually many of them have the $200 tier because that's what it's so good for. I think that in that world, we already see it like SWE-bench. And if you've looked at the benchmark made by some Stanford students, I wouldn't say it's really hard, but I wouldn't say it's easy either. I think it takes someone who's been through at least a few years of CS or a couple years of programming to do SWE-bench, well, and the models went from 4% to 60% in a year, and where are they going to go to next year? It's going to be higher. It probably won't be a hundred percent because again, that nines is really hard to do, but we're going to get to some point where that's, and then we're going to need harder software engineering benchmarks and so on and so forth.
Open source
4:37:49
Stargate
4:47:01
We didn't really talk about Stargate. I would love to get your opinion on what the new administration, the Trump administration, everything that's being done from the America side and supporting AI infrastructure and the efforts of the different AI companies. What do you think about Stargate? What are we supposed to think about Stargate and does Sam have the money? Yeah, so I think Stargate is a opaque thing. It definitely doesn't have $500 billion, doesn't even have $100 billion dollars. So what they announced is this $500 billion number, Larry Ellison, Sam Altman and Trump said it. They thanked Trump and Trump did do some executive actions that do significantly improve the ability for this to be built faster. One of the executive actions he did is on federal land, you can just basically build data centers in power pretty much like that. And then permitting process is basically gone or you file after the fact. So again, I had of schizo take earlier, another schizo take, if you've ever been to the Presidio in San Francisco, beautiful area, you could build a power plant in a data center there if you wanted to because it is federal land. It used to be a military base, but obviously this would people off. It's a good fit. Anyways, Trump has made it much easier to do this, right? Generally, Texas has the only unregulated grid in the nation as well.
Future of AI
4:54:30
What are you excited about these several years that are upcoming in terms of cluster build outs, in terms of breakthroughs in AI, the best possible future you can imagine in the next couple of years, two, three, four years? What does that look like? It could be very specific technical things like breakthroughs on post-training or it could be just size, big impressive clusters. I really enjoy tracking supply chain and who's involved and what, I really do. It's really fun to see the numbers, the cost, who's building what capacity, helping them figure out how much capacity they should build winning deals, strategic stuff. That's really cool. I think technologically, there's a lot around the networking side that really excites me with optics and electronics kind of getting closer and closer, whether it be co-packaged optics or some sort of forms of new forms of switching.