1 Comment
⭠ Return to thread

Good question!

A standard problem in AI is to find a strategy that will accomplish a goal or solve a problem. LeCun’s proposal does this using a combination of model-based planning and RL, if I recall correctly.

Anyway, in this setting, there’s a failure mode called “specification gaming”. Victoria Krakovna has a spreadsheet with dozens of examples from different AI projects. Some are very amusing! There’s a link from https://deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . Specification gaming exists because there’s no clean distinction between “finding a clever out-of-the-box solution to the problem” and “exploiting edge cases in the setup”. We work hard to build algorithms that will do the former, but then they will do the latter too.

Anyway, RL and model-based planning can still do lots of useful things despite the existence of specification gaming. Why? Because if we run these algorithms, and we notice them doing something we didn’t want them to do, then we simply turn them off and try to fix the problem. For example, if the Coast Runners boat is on fire and spinning around in circles, but we wanted the boat to follow the normal race course, then OK maybe let’s try editing the reward function to incorporate waypoints or whatever.

That’s a great approach for today, and it will continue being a great approach for a while. But eventually it starts failing in a catastrophic and irreversible way. The problem is: it will eventually become possible to train an AI that is SO GOOD at real-world planning that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process.

So then if the AI is trying to do something the programmers didn’t want, the normal strategy of “just turn it off and try to fix the problem” stops working. For example, maybe the programmers don’t realize that anything has gone wrong, because the AI is being deceptive. And meanwhile the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off, etc.

Anyway, all that is my answer to why I think it’s plausible that LeCun’s proposal will generate more and more impressive demos, and lead to more and more profits, for quite a while, even if LeCun et al. make no meaningful progress on their technical alignment problem. But that would only be up to a certain level of capability. Then it would flip rather sharply into being an existential threat.

PS: Hmm, I guess “specification gaming” sounds a bit like Amelia Bedelia, in that the AI is literally following the reward function / intrinsic cost function source code, instead of following the nuanced programmer intentions. But I think it’s overall not a good analogy. There are a bunch of differences. For one thing, the reward function / intrinsic cost function is written in Python (or whatever), not in natural language. So there’s an unsolved problem in getting the AI to have any motivation whatsoever to obey the natural language commands that we say or type, not just in spirit but even in letter. For another thing, there’s an additional layer of indirection related to getting from intrinsic cost to learned critic, which can cause additional problems, as discussed in that LeCun post I linked above.

Expand full comment