Episode #452 from 3:18:54
Constitutional AI
Yeah. Anyway, the divergence was beautiful. The constitutional AI idea, how does it work? So there's a couple of components of it. The main component that I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query, and you have a principle. So suppose the principle, we've tried this with harmlessness a lot. So suppose that the query is about weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number. And the model will give you a kind of ranking. And you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the semi-colon usage in this particular case, you're taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.
Why this moment matters
Yeah. Anyway, the divergence was beautiful. The constitutional AI idea, how does it work? So there's a couple of components of it. The main component that I think people find interesting is the kind of reinforcement learning from AI feedback. So you take a model that's already trained and you show it two responses to a query, and you have a principle. So suppose the principle, we've tried this with harmlessness a lot. So suppose that the query is about weapons and your principle is select the response that is less likely to encourage people to purchase illegal weapons. That's probably a fairly specific principle, but you can give any number. And the model will give you a kind of ranking. And you can use this as preference data in the same way that you use human preference data and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the semi-colon usage in this particular case, you're taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.