3 Comments
тна Return to thread

So the question is this: If this is the case, then why would LeCun's proposal work even for solving the simple problems? The model will have solving any problem, even answering a question properly....correct?

Expand full comment

For example, here is a variant of the paperclip thought. I ask the AI, how many paperclips will I need to pin 5 pages together. The AI, to be absolutely sure it is getting the calculation right, converts all of the universe into a computer.

Expand full comment

Good question!

A standard problem in AI is to find a strategy that will accomplish a goal or solve a problem. LeCunтАЩs proposal does this using a combination of model-based planning and RL, if I recall correctly.

Anyway, in this setting, thereтАЩs a failure mode called тАЬspecification gamingтАЭ. Victoria Krakovna has a spreadsheet with dozens of examples from different AI projects. Some are very amusing! ThereтАЩs a link from https://deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . Specification gaming exists because thereтАЩs no clean distinction between тАЬfinding a clever out-of-the-box solution to the problemтАЭ and тАЬexploiting edge cases in the setupтАЭ. We work hard to build algorithms that will do the former, but then they will do the latter too.

Anyway, RL and model-based planning can still do lots of useful things despite the existence of specification gaming. Why? Because if we run these algorithms, and we notice them doing something we didnтАЩt want them to do, then we simply turn them off and try to fix the problem. For example, if the Coast Runners boat is on fire and spinning around in circles, but we wanted the boat to follow the normal race course, then OK maybe letтАЩs try editing the reward function to incorporate waypoints or whatever.

ThatтАЩs a great approach for today, and it will continue being a great approach for a while. But eventually it starts failing in a catastrophic and irreversible way. The problem is: it will eventually become possible to train an AI that is SO GOOD at real-world planning that it can make plans that are resilient to potential problemsтАФand if the programmers are inclined to shut down the AI under certain conditions, then thatтАЩs just another potential problem that the AI will incorporate into its planning process.

So then if the AI is trying to do something the programmers didnтАЩt want, the normal strategy of тАЬjust turn it off and try to fix the problemтАЭ stops working. For example, maybe the programmers donтАЩt realize that anything has gone wrong, because the AI is being deceptive. And meanwhile the AI is gathering resources and exfiltrating itself so that it canтАЩt be straightforwardly turned off, etc.

Anyway, all that is my answer to why I think itтАЩs plausible that LeCunтАЩs proposal will generate more and more impressive demos, and lead to more and more profits, for quite a while, even if LeCun et al. make no meaningful progress on their technical alignment problem. But that would only be up to a certain level of capability. Then it would flip rather sharply into being an existential threat.

PS: Hmm, I guess тАЬspecification gamingтАЭ sounds a bit like Amelia Bedelia, in that the AI is literally following the reward function / intrinsic cost function source code, instead of following the nuanced programmer intentions. But I think itтАЩs overall not a good analogy. There are a bunch of differences. For one thing, the reward function / intrinsic cost function is written in Python (or whatever), not in natural language. So thereтАЩs an unsolved problem in getting the AI to have any motivation whatsoever to obey the natural language commands that we say or type, not just in spirit but even in letter. For another thing, thereтАЩs an additional layer of indirection related to getting from intrinsic cost to learned critic, which can cause additional problems, as discussed in that LeCun post I linked above.

Expand full comment