Episode #416 from 38:51

V-JEPA

So that's the I-JEPA. It doesn't need to know that it's an image for example, because the only thing it needs to know is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that, that are really image specific. A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so a whole segment of each frame in the video over the entire video. And that tube was statically positioned throughout the frames, just literally it's a straight tube.

Why this moment matters

So that's the I-JEPA. It doesn't need to know that it's an image for example, because the only thing it needs to know is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that, that are really image specific. A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so a whole segment of each frame in the video over the entire video. And that tube was statically positioned throughout the frames, just literally it's a straight tube.

Starts at 38:51
People and topics
All moments
V-JEPA chapter timestamp | Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | EpisodeIndex