Episode #490 from 2:44:06
Long context
We haven't really mentioned it much, but implied in this discussion is context length as well. Is there a lot of innovations that's possible there? I think the colloquially accepted thing is that it's a compute and data problem. Sometimes there are small architecture things, like attention variants. We talked about hybrid attention models, which is essentially if you have what looks like a state space model within your transformer. Those are better suited because you have to spend less compute to model the furthest along token. But those aren't free because they have to be accompanied by a lot of compute or the right data. How many sequences of 100,000 tokens do you have in the world, and where do you get these? It just ends up being pretty expensive to scale them.
Why this moment matters
We haven't really mentioned it much, but implied in this discussion is context length as well. Is there a lot of innovations that's possible there? I think the colloquially accepted thing is that it's a compute and data problem. Sometimes there are small architecture things, like attention variants. We talked about hybrid attention models, which is essentially if you have what looks like a state space model within your transformer. Those are better suited because you have to spend less compute to model the furthest along token. But those aren't free because they have to be accompanied by a lot of compute or the right data. How many sequences of 100,000 tokens do you have in the world, and where do you get these? It just ends up being pretty expensive to scale them.