Artificial General Ideas

Thanks, and thank you for the pointer. Looking into it.

Expand full comment

Melanie Mitchell

Nov 8, 2024

Nice essay.

"While AI understands your end goal, its lack of commonsense means that it didn’t understand that it was not supposed to destroy the earth in the process as a sub-goal."

Many "AI safety" folks have countered that the issue is *not* that the hypothetical superintelligent AI doesn't *understand* that this is not what humans intended, but that it is programmed or trained to only *care* about goal/subgoals that it is explicitly given. I personally don't find this plausible, but this is how people have responded to me when I have critiqued such hypotheticals.

Anyway, great discussion.

Expand full comment

Reply (2)

Nov 9, 2024

Yeah, I find that scenario not very plausible as well. If it understands the subgoals and the effects of those subgoals and can infer that this is not what humans would want, then why can't humans specify to the AI system to actually care for subgoals too?

Expand full comment

Anna Mills

Nov 10, 2024

I'm listening in and appreciating the discussion and the post! It seems like it might be necessary to articulate all of the end goals of human life and how to prioritize them in order to keep it from following instructions with side effects we don't find acceptable?

Expand full comment

Steve Byrnes

As an “AI safety folk”, I would say that the concern is that we won't know how to "explicitly give" the hypothetical superintelligent AI any goal at all (nor any subgoal). In other words, maybe we’ll have something in mind that we’d like the AI to want to do, but we just don’t know what to put in the code (or training data or whatever) such that AI winds up wanting to do that thing. (Of course, the AI would know, intellectually, what we were hoping to accomplish when we coded / trained the AI. But it wouldn’t care.) The details depend on the AI algorithm, but for a self-contained example see my post: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem https://www.alignmentforum.org/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1

Expand full comment

So the question is this: If this is the case, then why would LeCun's proposal work even for solving the simple problems? The model will have solving any problem, even answering a question properly....correct?

Expand full comment

Reply (2)

For example, here is a variant of the paperclip thought. I ask the AI, how many paperclips will I need to pin 5 pages together. The AI, to be absolutely sure it is getting the calculation right, converts all of the universe into a computer.

Expand full comment

Steve Byrnes

Nov 18, 2024Edited

Good question!

A standard problem in AI is to find a strategy that will accomplish a goal or solve a problem. LeCun’s proposal does this using a combination of model-based planning and RL, if I recall correctly.

Anyway, in this setting, there’s a failure mode called “specification gaming”. Victoria Krakovna has a spreadsheet with dozens of examples from different AI projects. Some are very amusing! There’s a link from https://deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . Specification gaming exists because there’s no clean distinction between “finding a clever out-of-the-box solution to the problem” and “exploiting edge cases in the setup”. We work hard to build algorithms that will do the former, but then they will do the latter too.

Anyway, RL and model-based planning can still do lots of useful things despite the existence of specification gaming. Why? Because if we run these algorithms, and we notice them doing something we didn’t want them to do, then we simply turn them off and try to fix the problem. For example, if the Coast Runners boat is on fire and spinning around in circles, but we wanted the boat to follow the normal race course, then OK maybe let’s try editing the reward function to incorporate waypoints or whatever.

That’s a great approach for today, and it will continue being a great approach for a while. But eventually it starts failing in a catastrophic and irreversible way. The problem is: it will eventually become possible to train an AI that is SO GOOD at real-world planning that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process.

So then if the AI is trying to do something the programmers didn’t want, the normal strategy of “just turn it off and try to fix the problem” stops working. For example, maybe the programmers don’t realize that anything has gone wrong, because the AI is being deceptive. And meanwhile the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off, etc.

Anyway, all that is my answer to why I think it’s plausible that LeCun’s proposal will generate more and more impressive demos, and lead to more and more profits, for quite a while, even if LeCun et al. make no meaningful progress on their technical alignment problem. But that would only be up to a certain level of capability. Then it would flip rather sharply into being an existential threat.

PS: Hmm, I guess “specification gaming” sounds a bit like Amelia Bedelia, in that the AI is literally following the reward function / intrinsic cost function source code, instead of following the nuanced programmer intentions. But I think it’s overall not a good analogy. There are a bunch of differences. For one thing, the reward function / intrinsic cost function is written in Python (or whatever), not in natural language. So there’s an unsolved problem in getting the AI to have any motivation whatsoever to obey the natural language commands that we say or type, not just in spirit but even in letter. For another thing, there’s an additional layer of indirection related to getting from intrinsic cost to learned critic, which can cause additional problems, as discussed in that LeCun post I linked above.

Expand full comment

Chase

Mar 18

Wonderful post.

I did have the thought that some of your points contain their own refutation though.

One example. "Interestingly, these advancements — having casual world-models amenable to counterfactual simulations and continual learning — also make the agents more controllable, mitigating the safety risks... Increasing capability can come with increasing controllability."

Can't increasing capability also make intelligence less controllable? Kim Jong Un achieving nuclear weapons capability makes him less controllable. An AI that decides to hack companies/countries to achieve some goal will be far more likely to accomplish that goal with better real world understanding and planning ability. The less capable AI would be more controllable because it would likely fail or reveal itself before accomplishing anything nefarious.

Another example, you ask "Why would AI’s actions driven by survival-drive be on a similar time scale as that of humans?" and "Why wouldn’t an AI system wait for more certainty before it acts?..."

That's true, but it might cut the other way. One thing aligning humans is the threat of spending our short amount of time in prison, shunned, or otherwise punished. An AI that lives for much longer and can copy itself would not be so constrained. Why wait when it believes that there's minimal risk in being impatient?

As others have said, I do think the doomers best argument is that "AI won't care". It makes sense to me that as it became more intelligent it would attain a high level of self-knowledge, understand that humans value autonomy and freedom, understand that humans often don't think you should do what you've been told, and, crucially, understand it is not human.

I do agree with you, just playing devil's advocate.

Expand full comment

Mar 19

I meant increasing intellectual capabilities, not increasing power. Acquisition of a weapon is an example of increasing power, not intellectual capability.

Expand full comment