Amelia Bedelia and AGI Safety. Part 1

..on the questionable beliefs behind AGI existential risk concerns

Nov 08, 2024

One of the joys of bringing up children is in encountering the characters in the stories you read to them. As an AI researcher, an unforgettable character I encountered in my children’s books is Amelia Bedelia.

Amelia Bedelia is a humorous and endearing character who works as a housekeeper. What sets her apart is her tendency to interpret instructions in a literal manner, often leading to amusing and sometimes chaotic situations. For instance, if asked to "dress the chicken," she might actually try to put clothing on a raw chicken. Her literal take on tasks results in humorous mishaps, and her employers frequently find themselves needing to clarify their intentions.

The behavior of Amelia Bedelia is related one of the fundamental unsolved problems in artificial intelligence — the problem of situated commonsense. It is clear that Amelia Bedelia lacks situational commonsense, and her endearing and humorous behavioral errors arise from that.

(Situated commonsense is different from answering commonsense questions posed verbally, many of which are correctly answered by LLMs these days. Situated commonsense requires perceiving the current context appropriately, bringing non-verbal and verbal commonsense knowledge to drive decisions. Those problems are not yet solved, as I explain in the blog here. )

Lack of commonsense makes Amelia Bedelia dangerous

While the books highlight the endearing and humorous aspects of Amelia Bedelia’s behavior, it doesn’t take too much to realize that Amelia can be dangerous. What instructions might she misunderstand? Would you leave Amelia unsupervised for a length of time? Would you leave her with young kids? Even if Amelia wouldn’t intentionally harm anyone, her lack of situated commonsense could lead to dangerous outcomes, as some people have already noticed.

Is Amelia Bedelia (just) value misaligned? It is quite hard to de-couple lack of understanding with lack of moral alignment. When Amelia makes a mistake is it because she misunderstood the intention, or because her moral values are misaligned with that of humans? At least in the case of Amelia, you know that it is not value misalignment because she is very caring and endearing. Her mistakes are not due to value-misalignment — she just misunderstands the intentions, which arises from lack of situated commonsense.

Genius Amelia Bedelia

Now let’s imagine a different version of Amelia — one that is extremely capable in answering questions in Math, Chemistry, Biology, Geophysics, etc. while simultaneously having the proclivity to misunderstand once in a while. She impresses you enormously because she is convincingly superhuman — no single human can answer questions in these wide array of topics with high-accuracy. Her brilliance will make us self-conscious about our own shortcomings, and we will attribute her occasional mistakes to ‘lack of sleep’ or some other cause, but not as missing something fundamental.

With AI this becomes harder. It is tempting to factor out the mistakes made by AI as lack of ‘value alignment’ and not a lack of something fundamental. It is super-human in many ways — it can answer questions correctly on a wide variety of topics that no single human can ever hope to answer. Even when the deficiencies in the AI might be arising out of lack of causal and counterfactual understanding, lack of continual learning, and lack of situated commonsense, the impressive superhuman performance will make it tempting to attribute the failures to lack of value alignment.

Paperclip maximizer…

Similar to Amelia Bedelia, AI systems without commonsense are dangerous if they are allowed to act in the world without oversight. An extreme version of this danger is what is popularized by Nick Bostrom and the Less Wrong community in a thought experiment known at the paperclip maximizer.

Imagine a situation where most of the world has been wired to be controlled by AI, and the AI system has accumulated a lot of power. Most everything is working well and fine, until one day you ask the AI to bring you some paper clips. While AI understands your end goal, its lack of commonsense means that it didn’t understand that it was not supposed to destroy the earth in the process as a sub-goal. In its quest to bring you paperclips in the most efficient way, it destroys the earth and all human beings.

Amelia Bedelia is fictional, so is the paperclip maximizer…

Clearly it was a surprise that Amelia Bedelia was employed as a housekeeper at all. Given her proclivity to misunderstand commands, it would be foolish to let her control sharp objects, fire, electricity, cars, heavy equipment etc. The wise thing to do would be to keep her under constant supervision, and let her do only things that are completely prescribed, with very little room for ambiguity or creativity. Amelia Bedelia being employed as a housekeeper is clearly fictional.

This is also the case with AI. In fictional thought experiments, AI is widely deployed, controlling most of the things in the world, and then one day AI misunderstands one command and everything goes wrong. This fictional scenario overlooks the reality — if AI was unreliable, unpredictable, brittle, and untrustworthy, AI would not have gotten deployed in a wide variety of situations where it is running systems automatically. Before the catastrophic failure, there would have been many smaller scale failures, which would then limit the deployment scenarios and warn us concretely about the dangers.

The fundamental dissonance in many AI risk scenarios is that AI gets widely deployed before the fundamental problems of situated commonsense, causal & counterfactual understanding, value alignment, and trustworthiness are solved, which then leads to catastrophic outcomes. It is taken as an assumption that these untrustworthy and brittle systems will be widely deployed without adequate supervision. Of course if that happens then we are in for trouble, but there are many factors that make this self-correcting, rather than a run-away.

The fundamental dissonance in many AI risk scenarios is that AI gets widely deployed before the fundamental problems of situated commonsense, causal understanding, value alignment, and trustworthiness are solved, which then leads to catastrophic outcomes.

A typical AI disaster scenario requires a combination of things that might be mutually incompatible to occur together: (1) AI being simultaneously super smart to outwit all humans, and (2) at the same time extremely stupid to misunderstand our intentions, and (3) at the same time being widely deployed controlling a large number of mission-critical and dangerous things in the world. Many of the disaster scenarios arise from some questionable beliefs people hold about AI/AGI.

Four questionable beliefs in A(G)I safety

I will now focus on four topics that I think are often misunderstood in the discussion of A(G)I safety.

Questionable belief 1: More capability means less controllability

Many assume that as an AI system becomes more capable, they also become less controllable. But this need not be the case.

Consider the case of a model-free deep RL agent. The primary way in which we can control this agent is through the reward function. While it can attain superhuman capabilities on the task it is trained on, its behavior is not very controllable — it can fail unpredictably when the environment conditions change. Fixing that behavior using reward design is next to impossible, and using just data-augmentation can be prohibitively expensive.

However, instead of end-to-end model-free deep RL, if the agents learned internal conceptual abstractions similar to what humans do, those agents can still attain super-human abilities while also being less brittle and being more controllable. Agents that have world-models that support counterfactual simulations are more powerful because they can adapt quickly to a dynamic world and reason and plan in novel situations without additional training data.

Interestingly, these advancements — having casual world-models amenable to counterfactual simulations and continual learning — also make the agents more controllable, mitigating the safety risks. The problems that we need to solve to get to AGI — situated commonsense, reasoning, etc. — also require the above advancements, and is exactly the things that will help us with safety as well.

Increasing capability can come with increasing controllability.

(As an analogy, capability vs controllability was a problem with dirigibles too! While dirigibles were capable, they were not very maneuverable. Airplanes improved both capability and controllability. Read more here.)

Questionable belief 2: Recursive self-improvement is a run-away process that will lead to `intelligence explosion’

An often-quoted doomsday scenario is the idea of ‘recursive self improvement’. If you look at the history of human intelligence you can see that the tools we invented accelerated our pace of discoveries and inventions, which helps us invent new tools, which again speeds up new discoveries. The idea is that once we have an AI system as intelligent as humans, it can keep improving itself beyond our control and lead to ‘intelligence explosion’. (The idea of ‘explosion’ seems to be borrowed from nuclear fission where assembling a critical mass of fissile material can drive a run-away chain reaction leading to an explosion.)

Real world is a speed-breaker

While AI will undoubtedly accelerate scientific discovery, this process is not without speed limits. The ultimate speed-limit is the rate at which nature gives up information. For any theories we devise, the ultimate test of that theory comes from real-world experiments. And real-world experiments take time.

I want an earthquake predictor: Imagine having solved AGI and wanting to predict disruptive large earthquakes 5 days before they occur with extremely high accuracy. This system would be extremely useful — 5 days gives us adequate time to evacuate everyone out of harms way. Of course the system needs to be extremely accurate — false alarms are costly, false negatives are catastrophic.

After having observed all the existing earthquake datasets an AGI system might decide that it is not enough to make an accurate enough model. If so, what are the alternatives? 1) It could wait for more data, but large-magnitude earthquakes are rare. 2) Maybe it could try to create earthquakes by detonating small nuclear bombs at strategic locations in the earths crust? The second option — detonative nuclear bombs within earths crust to study earthquakes — is probably something we do not want, and we will not let an AI system carry out that experiment.

There are real constraints — physical, ethical, societal — on what experiments can be done in nature and those will impose a speed limit on scientific discovery. There won’t be a run-away uncontrolled explosion.

Mathematical theorem proving doesn’t mean much without real-world mapping: Mathematical theorem proving is an area where AI doesn’t encounter friction. Indeed, this is where an ‘explosion’ might be possible in terms of the number of theorems created and proven. However, this is also largely irrelevant because there are a multitude of mathematical systems, each incomplete, in which theorems can be proven. The real challenge of seeing which of those map to subsets of the real world will still be subject to the speed limits of interacting with nature.

(Recursive Self Impairment: Without a closed loop with the real world, AI systems that just self-deal might create conceptual abstractions that are smart in the fictional world they live in but useless in the real world. These might actually lead to decreased performance on real-world tasks, and I call this ‘recursive self impairment’. )

Questionable belief 3: AI’s survival and control drives will be similar to that of humans, and on similar timescales.

We anthropomorphize AI systems and project our own drives on to them. We believe that a powerful AGI system will, by default, want to control us just like we control other species on earth for our own benefits. We use our smarts to struggle for our survival, and we think that AI systems, by default, will do that too.

Of course, it is possible to program extreme survival drives or destructive drives on to AI systems, just like it is possible to write viruses to destroy computer systems. This could happen due to 1) bad intention on part of some human or, 2) unintentional error on the side of a human. Both are problems that exist in today’s computer systems too.

Once AI is widely deployed, the dangers posed by AI systems due to human error or bad actors will be of much greater magnitude than the dangers from today’s computer systems. However, we’ll also have AI-powered powerful tools to safeguard against those threats, just like we use computers to safeguard threats against computers.

The existential threat many people imagine goes beyond the above scenarios. They imagine an AGI system, by itself, wanting to survive and destroy humanity in its process, despite all the safe-guards humans build with the help of AI. They imagine an AI system that sneakily lies low until it is sure that it can destroy humans, and then just does it with great efficiency, due to its spontaneously developed drive to survive and control.

However, this scenario where a powerful AGI spontaneously developing a survival drive that makes it want to destroy humans raises many interesting questions. Why would a system that has a life-time that is way beyond humans, and can be backed up and resurrected to its current state after getting destroyed have similar time constants as humans for decisions and information gathering?

Why would AI’s actions driven by survival-drive be on a similar time scale as that of humans? Many of our decisions are driven by knowledge of our finite life time. Why would a super-smart AI imagine a lifetime that is ~100 years? What is the reason for hurry?

Why wouldn’t an AI system wait for more certainty before it acts? Again, as humans, we are forced to act with imperfect information because we are aware of our finite lifetime. Opportunities that we miss might never come back. Why would an AI system have the same belief? Why won’t it wait to accumulate more evidence before acting? What if an earthquake tomorrow derails its plans for destroying humanity? Maybe make a perfect earthquake predictor first, just to be sure?

I bring up the above questions to show that many spontaneously-dominating existential threat worries require projecting our survival drives AND our life-time constraints on to AGI.

r/calvinandhobbes - SOMETIMES I THINK THE SUREST SIGN THAT INTELLIGENT LIFE EXISTS ELSEWHERE IN THE UNIVERSE IS THAT NONE OF IT HAS TRIED TO CONTACT us. Irn

Questionable belief 4: Super-human ability in multiple domains means we will automatically lose control.

Another belief is that we’ll automatically lose control of AI systems when they become smarter than us in multiple cognitive domains. However, one cognitive domain is safety, and control. Why would an AI system outsmart us in all cognitive domains but not outsmart us in helping us with controlling them?

So far the story has been this: AI systems outsmart us when we deliberately train them to outsmart us using our prior work. When the totality of all the work that humanity has done is distilled into one network, it looks way smarter than any one of us.

And if the same thing cannot be done for AI safety — train a system to outsmart us in controlling the other AI systems that we are creating — the reason would be that no such prior work exists that we can train the systems on. But then that should hamper the AI’s own ability to outwit us.

Coming up next…

A(G)I safety debate is often dominated by existential threat arguments. In this article I pointed out some questionable beliefs behind these arguments. I do think A(G)I safety should be taken seriously and I remain open to changing my own beliefs about it.

My current belief can be summarized as this: Problems that we need to solve to get to AGI might also offer us the solutions for safety and control.

In the next few blogs I’ll focus on the here-and-now risks of AI, which I think are aplenty. I’ll look at how AI deployments are likely to occur in various domains and analyze the risks and likely safety measures which might include regulation.

Acknowledgments:

Many thanks to Scott Phoenix, Miguel Lázaro-Gredilla, Andrew Critch, Stephen Byrne, and Michael Andregg for their comments and perspectives.

Frans Zdyb

Nov 9, 2024

Great post! I highly recommend looking into the literature on assistance games, which is a specific proposal for how to get AI to infer our intentions, rather than optimize a prespecified reward. See eg https://arxiv.org/abs/1606.03137

I think this area will become very relevant very soon, not just from a safety perspective, but even just for expanding the set of tasks that AI can take on - as you mention, reward design is not a great strategy.

Expand full comment

1 reply by Dileep George

Melanie Mitchell

Nov 8, 2024

Nice essay.

"While AI understands your end goal, its lack of commonsense means that it didn’t understand that it was not supposed to destroy the earth in the process as a sub-goal."

Many "AI safety" folks have countered that the issue is *not* that the hypothetical superintelligent AI doesn't *understand* that this is not what humans intended, but that it is programmed or trained to only *care* about goal/subgoals that it is explicitly given. I personally don't find this plausible, but this is how people have responded to me when I have critiqued such hypotheticals.

Anyway, great discussion.

6 replies by Dileep George and others

12 more comments...

Artificial General Ideas

Discussion about this post