10 Comments

Great post! I highly recommend looking into the literature on assistance games, which is a specific proposal for how to get AI to infer our intentions, rather than optimize a prespecified reward. See eg https://arxiv.org/abs/1606.03137

I think this area will become very relevant very soon, not just from a safety perspective, but even just for expanding the set of tasks that AI can take on - as you mention, reward design is not a great strategy.

Expand full comment

Thanks, and thank you for the pointer. Looking into it.

Expand full comment

Nice essay.

"While AI understands your end goal, its lack of commonsense means that it didn’t understand that it was not supposed to destroy the earth in the process as a sub-goal."

Many "AI safety" folks have countered that the issue is *not* that the hypothetical superintelligent AI doesn't *understand* that this is not what humans intended, but that it is programmed or trained to only *care* about goal/subgoals that it is explicitly given. I personally don't find this plausible, but this is how people have responded to me when I have critiqued such hypotheticals.

Anyway, great discussion.

Expand full comment

Yeah, I find that scenario not very plausible as well. If it understands the subgoals and the effects of those subgoals and can infer that this is not what humans would want, then why can't humans specify to the AI system to actually care for subgoals too?

Expand full comment

I'm listening in and appreciating the discussion and the post! It seems like it might be necessary to articulate all of the end goals of human life and how to prioritize them in order to keep it from following instructions with side effects we don't find acceptable?

Expand full comment

As an “AI safety folk”, I would say that the concern is that we won't know how to "explicitly give" the hypothetical superintelligent AI any goal at all (nor any subgoal). In other words, maybe we’ll have something in mind that we’d like the AI to want to do, but we just don’t know what to put in the code (or training data or whatever) such that AI winds up wanting to do that thing. (Of course, the AI would know, intellectually, what we were hoping to accomplish when we coded / trained the AI. But it wouldn’t care.) The details depend on the AI algorithm, but for a self-contained example see my post: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem https://www.alignmentforum.org/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1

Expand full comment

So the question is this: If this is the case, then why would LeCun's proposal work even for solving the simple problems? The model will have solving any problem, even answering a question properly....correct?

Expand full comment

For example, here is a variant of the paperclip thought. I ask the AI, how many paperclips will I need to pin 5 pages together. The AI, to be absolutely sure it is getting the calculation right, converts all of the universe into a computer.

Expand full comment

Good question!

A standard problem in AI is to find a strategy that will accomplish a goal or solve a problem. LeCun’s proposal does this using a combination of model-based planning and RL, if I recall correctly.

Anyway, in this setting, there’s a failure mode called “specification gaming”. Victoria Krakovna has a spreadsheet with dozens of examples from different AI projects. Some are very amusing! There’s a link from https://deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . Specification gaming exists because there’s no clean distinction between “finding a clever out-of-the-box solution to the problem” and “exploiting edge cases in the setup”. We work hard to build algorithms that will do the former, but then they will do the latter too.

Anyway, RL and model-based planning can still do lots of useful things despite the existence of specification gaming. Why? Because if we run these algorithms, and we notice them doing something we didn’t want them to do, then we simply turn them off and try to fix the problem. For example, if the Coast Runners boat is on fire and spinning around in circles, but we wanted the boat to follow the normal race course, then OK maybe let’s try editing the reward function to incorporate waypoints or whatever.

That’s a great approach for today, and it will continue being a great approach for a while. But eventually it starts failing in a catastrophic and irreversible way. The problem is: it will eventually become possible to train an AI that is SO GOOD at real-world planning that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process.

So then if the AI is trying to do something the programmers didn’t want, the normal strategy of “just turn it off and try to fix the problem” stops working. For example, maybe the programmers don’t realize that anything has gone wrong, because the AI is being deceptive. And meanwhile the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off, etc.

Anyway, all that is my answer to why I think it’s plausible that LeCun’s proposal will generate more and more impressive demos, and lead to more and more profits, for quite a while, even if LeCun et al. make no meaningful progress on their technical alignment problem. But that would only be up to a certain level of capability. Then it would flip rather sharply into being an existential threat.

PS: Hmm, I guess “specification gaming” sounds a bit like Amelia Bedelia, in that the AI is literally following the reward function / intrinsic cost function source code, instead of following the nuanced programmer intentions. But I think it’s overall not a good analogy. There are a bunch of differences. For one thing, the reward function / intrinsic cost function is written in Python (or whatever), not in natural language. So there’s an unsolved problem in getting the AI to have any motivation whatsoever to obey the natural language commands that we say or type, not just in spirit but even in letter. For another thing, there’s an additional layer of indirection related to getting from intrinsic cost to learned critic, which can cause additional problems, as discussed in that LeCun post I linked above.

Expand full comment

If the AI is capable of dealing with open world problems, then it is likely also capable of stopping when in doubt and clarifying/exploring instead of pushing ahead in a particular direction. Open world problem solving is highly collaborative: without that skill of collaboration, it won't be able be a good problem solver. That collaboration which will make the AI a good problem solver, will also give us control to steer it.

There is one scenario I think of where we might have less control: if we use genetic style algorithms with random mutations to guide development of its "motivation circuits". But then it will be a choice we would have made when setting the system, and not something accidental. Also, such approaches will be very inefficient compute wise.

Expand full comment