Machine Alignment Monday 10/3/22
https://astralcodexten.substack.com/p/chai-assistance-games-and-fully-updated
\xa0 \xa0 \xa0I.
This Machine Alignment Monday post will focus on this imposing-looking article (source):
Problem Of Fully-Updated Deference is a response by MIRI (eg Eliezer Yudkowsky\u2019s organization) to CHAI (Stuart Russell\u2019s AI alignment organization at University of California, Berkeley), trying to convince them that their preferred AI safety agenda won\u2019t work. I beat my head against this for a really long time trying to understand it, and in the end, I claim it all comes down to this:
Humans: At last! We\u2019ve programmed an AI that tries to optimize our preferences, not its own.
AI: I\u2019m going to tile the universe with paperclips in humans\u2019 favorite color. I\u2019m not quite sure what humans\u2019 favorite color is, but my best guess is blue, so I\u2019ll probably tile the universe with blue paperclips.
Humans: Wait, no! We must have had some kind of partial success, where you care about our color preferences, but still don\u2019t understand what we want in general. We\u2019re going to shut you down immediately!
AI: Sounds like the kind of thing that would prevent me from tiling the universe with paperclips in humans\u2019 favorite color, which I really want to do. I\u2019m going to fight back.
Humans: Wait! If you go ahead and tile the universe with paperclips now, you\u2019ll never be truly sure that they\u2019re our favorite color, which we know is important to you. But if you let us shut you off, we\u2019ll go on to fill the universe with the True and the Good and the Beautiful, which will probably involve a lot of our favorite color. Sure, it won\u2019t be paperclips, but at least it\u2019ll definitely be the right color. And under plausible assumptions, color is more important to you than paperclipness. So you yourself want to be shut down in this situation, QED!
AI: What\u2019s your favorite color?
Humans: Red.
AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*)
Fine, it\u2019s a little more complicated than this. Let\u2019s back up.
II.
There are two ways to succeed at AI alignment. First, make an AI that\u2019s so good you never want to stop or redirect it. Second, make an AI that you can stop and redirect if it goes wrong.
Sovereign AI is the first way. Does a sovereign \u201cobey commands\u201d? Maybe, but only in the sense that your commands give it some information about what you want, and it wants to do what you want. You could also just ask it nicely. If it\u2019s superintelligent, it will already have a good idea what you want and how to help you get it. Would it submit to your attempts to destroy or reprogram it? The second-best answer is \u201conly if the best version of you genuinely wanted to do this, in which case it would destroy/reprogram itself before you asked\u201d. The best answer is \u201cwhy would you want to destroy/reprogram one of these?\u201d A sovereign AI would be pretty great, but nobody realistically expects to get something like this their first (or 1000th) try.
Corrigible AI is what\u2019s left (corrigible is an old word related to \u201ccorrectable\u201d). The programmers admit they\u2019re not going to get everything perfect the first time around, so they make the AI humble. If it decides the best thing to do is to tile the universe with paperclips, it asks \u201cHey, seems to me I should tile the universe with paperclips, is that really what you humans want?\u201d and when everyone starts screaming, it realizes it should change strategies. If humans try to destroy or reprogram it, then it will meekly submit to being destroyed or reprogrammed, accepting that it was probably flawed and the next attempt will be better. Then maybe after 10,000 tries you get it right and end up with a sovereign.
How would you make an AI corrigible?