Episode #416 from 38:51
V-JEPA
So that's the I-JEPA. It doesn't need to know that it's an image for example, because the only thing it needs to know is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that, that are really image specific. A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so a whole segment of each frame in the video over the entire video. And that tube was statically positioned throughout the frames, just literally it's a straight tube.
People
Why this moment matters
So that's the I-JEPA. It doesn't need to know that it's an image for example, because the only thing it needs to know is how to do this masking. Whereas with DINO, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that, that are really image specific. A more recent version of this that we have is called V-JEPA. So it's basically the same idea as I-JEPA except it's applied to video. So now you take a whole video and you mask a whole chunk of it. And what we mask is actually kind of a temporal tube, so a whole segment of each frame in the video over the entire video. And that tube was statically positioned throughout the frames, just literally it's a straight tube.
People and topics
People