Episode #459

DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc.

What this episode covers

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc.

Where to start

DeepSeek-R1 and DeepSeek-V3

A lot of people are curious to understand China's DeepSeek AI models, so let's lay it out. Nathan, can you describe what DeepSeek-V3 and DeepSeek-R1 are, how they work, how they're trained? Let's look at the big picture and then we'll zoom in on the details. DeepSeek-V3 is a new mixture of experts, transformer language model from DeepSeek who is based in China. They have some new specifics in the model that we'll get into. Largely this is a open weight model and it's a instruction model like what you would use in ChatGPT. They also released what is called the base model, which is before these techniques of post-training. Most people use instruction models today, and those are what's served in all sorts of applications. This was released on, I believe, December 26th or that week. And then weeks later on January 20th, DeepSeek released DeepSeek-R1, which is a reasoning model, which really accelerated a lot of this discussion.

Start at 3:33

Low cost of training

There's two main techniques that they implemented that are probably the majority of their efficiency, and then there's a lot of implementation details that maybe we'll gloss over or get into later that contribute to it. But those two main things are, one is they went to a mixture of experts model, which we'll define in a second. And then the other thing is that they invented this new technique called MLA, latent attention. Both of these are big deals. Mixture of experts is something that's been in the literature for a handful of years. And OpenAI with GPT-4 was the first one to productize a mixture of experts model. And what this means is when you look at the common models around that most people have been able to interact with that are open, think Llama. Llama is a dense model i.e. every single parameter or neuron is activated as you're going through the model for every single token you generate. Now, with a mixture of experts model, you don't do that. How does the human actually work? Is like, oh, well my visual cortex is active when I'm thinking about vision tasks and other things. My amygdala is when I'm scared. These different aspects of your brain are focused on different things. A mixture of experts, models attempts to approximate this to some extent. It's nowhere close to what a brain architecture is, but different portions of the model activate. You'll have a set number of experts in the model and a set number that are activated each time. And this dramatically reduces both your training and inference costs because now if you think about the parameter count as the total embedding space for all of this knowledge that you're compressing down during training, one, you're embedding this data in instead of having to activate every single parameter, every single time you're training or running inference, now you can just activate on a subset and the model will learn which expert to route to for different tasks.

Start at 25:07

DeepSeek compute cluster

DeepSeek is very interesting. This is where a second could take to zoom out, out of who they are first of all, right? High-Flyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere. And they have always had a significant number of GPUs, right? In the past, a lot of these high-frequency trading, algorithmic quant traders used FPGAs, but it shifted to GPUs definitely. And there's both, but GPUs especially. And High-Flyer, which is the hedge fund that owns DeepSeek, and everyone who works for DeepSeek is part of High-Flyer to some extent. Same parent company, same owner, same CEO, they had all these resources and infrastructure for trading, and then they devoted a humongous portion of them to training models, both language models and otherwise, because these techniques were heavily AI-influenced.

Start at 51:25

People and topics
Key takeaways
  • DeepSeek-R1 and DeepSeek-V3
  • Low cost of training
  • DeepSeek compute cluster
  • Export controls on GPUs to China
All moments
DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters podcast chapters, timestamps & summary | EpisodeIndex