Monosemanticity

So can you talk toward monosematicity paper from October last year? I heard a lot of nice breakthrough results. That's very kind of you to describe it that way. Yeah, I mean, this was our first real success using sparse autoencoders. So we took a one-layer model, and it turns out if you go and you do dictionary learning on it, you find all these really nice interpretable features. So the Arabic feature, the Hebrew feature, the Base64 features were some examples that we studied in a lot of depth and really showed that they were what we thought they were. Turns out if you train a model twice as well and train two different models and do dictionary learning, you find analogous features in both of them. So that's fun. You find all kinds of different features. So that was really just showing that this works. And I should mention that there was this Cunningham and all that had very similar results around the same time.

November 11, 2024Unknown40 chaptersLex FridmanChris Olah

People

Dario Amodei Amanda Askell Chris Olah

Topics

Artificial Intelligence AGI Programming

Open full episode More from Lex Fridman Podcast Read transcript

Why this moment matters

Starts at 4:51:16

Artificial Intelligence AGI Programming

People and topics

People

Dario Amodei Amanda Askell Chris Olah

Topics

Artificial Intelligence AGI Programming

All moments