Welcome to the exciting dirigibles era of AI
Notes for navigating large language models and beyond...
Consider the dates of these two historical events:
1903: Wright brothers invented the airplane
1919: First non-stop transatlantic airplane fight.
Now try guessing this: In which year did the Hindenburg accident happen?
Many are surprised to know that it happened in 1937, more than 30 years after heavier-than-air flight was invented!
Of course it is well known that balloons and airships based on hot air, hydrogen, and helium existed before airplanes. But did you know that “For the first thirty years of the twentieth century airships were viewed as a more robust means of transportation than the airplane, consistently surpassing them in range, flight duration, and load-carrying capacity.”? (From the preface of When Giants Ruled the Sky)
A similar situation exists in AI today. The exciting success and rapid progress of large language models in partially, but usefully and impressively, solving many language processing tasks has triggered the thinking that human-like general intelligence can be attained by just scaling up the underlying technology. However, the story of aeronautics should offer us caution: Although scaling to bigger sizes was all that was needed to make balloons carry heavier cargo and fly for longer durations, those advances, while exciting and useful, were on a different path from the airplanes of today.
A clarification is warranted up front: I do think large language models are exciting and useful. OpenAI took a big risk and did marvelous research & engineering to show the world the promise of this, and I’m happy with the success and attention they are getting. I also think many insights can be learned from studying transformer architectures. The point of this article is not to deflate (and I don’t think I’ll be successful in doing so even if that was the intention) excitement around LLMs, but only to add perspective in the context of the longer term goal of human-like general intelligence. I also want to offer hope for people who might want to think differently, while keeping in mind the cost of what they might miss out on, without taking away from the excitement of the moment.
Different ways to fly. Different ways to solve human-like tasks
If flight is defined as traveling through air from locations A to B, there are different ways to fly. Catapulting is one. Ballooning is another. Neither are based on how birds fly.
Even before heavier-than-air flights were invented, balloons were very popular, and used extensively. This is because figuring out the principles of heavier-than-air flight was a much harder task — it was just very hard to keep an object that is heavier than air controllably up in the air for a long time. Balloons avoided this problem altogether because they were lighter than air. The success of balloon-builders over people trying to experiment with heavier than air flight was so thorough that a New York Times article in 1903 declared that any attempt at flying other than using balloons is unlikely to succeed in a million years, just two months before the Wright brothers achievement!
In the backdrop of pessimism surrounding heavier-than-air flights in the early 1900’s balloons offered this exciting possibility: Without having to figure out the principles of aerodynamics, we can build machines that travel through air and make them carry heavier payloads and go further distances simply by building them bigger.
Since most of the success of large language models arise from making the underlying transformer model bigger, and training it on more text (trillions of tokens), and training it using more compute for longer duration, transformer-based language model offers an intriguing possibility just like the balloons did in early 1900’s: Without having to figure out the principles behind human intelligence, we could build machines that solve more cognitive and human-like tasks simply by building them bigger, and training them with more data, compute, and human feedback.
Dirigibles was exciting technology, and so are large language models, and the scaling of other models.
Similar to today’s large language models, Dirigibles was exciting technology at their time. From 1890s, Santos-Dumont built a series of steerable airships numbered 1 to 13, all working on the same principles, but successively making them bigger, more controllable, and safer. Like the parameter counts of language models today, the size of the balloons were an important aspect in carrying capacity and controllability. Santos-Dumont No.4 built in 1900 had a gas capacity of 420 cubic meters. Santos-Dumont No.5 increased it to 622 cubic meters, and No.6 increased it slightly to 630 cubic meters. These were used in awe-inspiring and well publicized flights in France, and captured the imagination of the general public.
Just like how large language models required bringing together large-scale computing, GPUs, software engineering, and advances in neural net architectures, dirigibles were engineering marvels that required the integration of the latest in materials, structural engineering, propulsion, handling of hydrogen, and navigation. Dirigibles were awe-inducing sights in the sky, and their interiors dwarfed today’s airplanes in space and luxury. The excitement around dirigibles were so high even in the 1930’s that the builders of the Empire State building advertised plans for a mooring mast (an example of an API) atop it by publishing fake photographs of the dirigible USS Los Angeles docking there!
It is alright to be excited about building large neural nets.
Heavier than air flight problem was a north-star for some folks, and they might have found the excitement around balloons and dirigibles disheartening. Similarly, people who are interested in building real human-like intelligence might be disheartened by the exuberance around large language models. However, this need not be the case.
Dirigibles were a technology whose time had come, and ‘Foundation Models’ is a technology whose time has come. Technological advances require many different critical components to come together at the same time. If any one of the them is missing, the advance doesn’t happen. And when they do come together, advance happens very quickly until you get to the edge of the capabilities of the supporting technologies.
Once the basic principles behind dirigibles were figured out, they had a favorable scaling law going for them — to go further distances, and to carry heavier payloads you simply had to make them bigger, and give them more powerful engines. This was purely an engineering task. Although it has its own challenges, it is easy to organize teams around those to create generations of increasingly bigger models because each generation informs the next.
The large models simply have the right things behind them right now, and there is no stopping them until they exhaust their own runway. It will be a wild ride, and not every effort will succeed, but in the process many useful things will get built. Many companies will be created, and many will become wealthy and build careers on that. We will understand how far we can go just by scaling. We will learn, sometimes through painful experiences, the different failure modes of deploying imperfect models widely, and learn mitigation strategies which might include new regulations. All that is part of bringing a new technology to the world.
It is also OK not to be excited about just scaling up.
Excitement around LLMs doesn’t mean everyone has to be equally excited about them. We make progress by having people willing to question the dominant paradigm and strike out on paths that are new. It is heartening to see that pioneers of deep learning are among the ones exploring out alternative paths1 from just scaling up the current architectures. Despite their success, they remain hungry, foolish, and curious.
Again, the analogy to how heavier-than-air flights developed offers some perspective. When people figured out how to make heavier than air flight, they were deployed on problems that were considered ‘toy problems’ for balloons. They didn’t stay in the air for nearly as long as balloons could, or carry as much cargo as balloons could. And initial deployments of airplanes were in niches that took multiple breakthroughs to expand out from. When Wright brothers wanted to report on their successful flight, the Associated Press representative initially turned it down because the the plane flew for a mere 59 seconds, well below what balloons were capable of doing at that time.
Ultimately, balloons had fundamental scaling and controllability problems that heavier than air flights didn’t have. But it took multiple breakthroughs — first figuring out aerodynamics and lateral control, then figuring out scalable mechanisms to build those (eg, ailerons instead of wing warping), building jet engines, etc. — to convincingly demonstrate that for long-distance applications.
Similarly, some small set of researchers, engineers, entrepreneurs, and investors will stake out on different paths to find the principles of intelligence. There is sufficient evidence for problems with scaling and controllability of language models to warrant such pursuits. And as they figure out more of the principles of intelligence, more efficient and controllable architectures will emerge. But of course they will be compared to large language models which would have gone through multiple stages of engineering by then, so they might initially be deployed in niches or in complementary situations.
Parting thoughts and future articles
I hope the story of dirigibles offers encouragement for both people who are scaling up, and for people who are exploring other directions.
One criticism of this comparison is that the transformer architecture might already be like heavier than air flight in the sense that it already encapsulates the principles of intelligence. One can never be 100% sure, but there are sufficient reasons to believe this is not the case: the autoregressive architecture has fundamental limitations in learning efficiency, flexible reasoning, mixing in episodic memory, and controllability. Moreover, the ELIZA effect of language often makes us see more than there is while interacting with language models. Like Wright brothers learned from birds that flapping wings is not necessary to fly2, using neuro and cognitive science insights to learn planning- and causality- compatible architectures is an exciting frontier. I plan to write more on this in the future.
I’d be surprised if people who are scaling up needed encouragement. This is the scale up moment. Seize it, run with it, don’t look back! The outputs are already exciting and more is to come. And hopefully you are secure enough that you won’t find the comic I have below demotivating in any way.
People who are exploring different novel ideas probably needs more encouragement amidst all the current excitement. There are sufficient reasons to believe that this moment in history is like dirigibles — exciting technology that shows the wonderful promise of real intelligence that is yet to come. Whenever someone is too self-congratulatory and smug about the current progress, maybe this comic will help you find some perspective.
I also think there are also avenues to combine the strengths of current approaches with an investigation into future architectures. For this reason, even people who are exploring new directions should remain curious about the current ones and study them in detail and avoid hasty dismissals.
I plan to write more about these topics in the future. In particular, a few articles that have sketched out are about “world models”, “what is understanding”, “the sweet lesson behind bitter lessons”, etc. If you are interested in these topics, please subscribe.
You might find these links interesting:
I gave two 10-minute presentations at the AGI debate related to this. Check it out here. Presentation 1: exciting paths forward in AI & Presentation 2: Commonsense needs mental simulation
Space is a latent sequence: Check out this 15-minute presentation that will change your mind about how brains learn and represent space. And read our paper on that: https://arxiv.org/abs/2212.01508
Read this paper by Yann LeCun: https://openreview.net/pdf?id=BZ5a1r-kVsf
Read this paper by Anirudh Goyal & Yoshua Bengio: https://royalsocietypublishing.org/doi/full/10.1098/rspa.2021.0068
Follow me on Twitter: @dileeplearning
Check out My website
Disclaimer: Views expressed here are my own.
For, examples see the papers from Yann LeCun (https://openreview.net/pdf?id=BZ5a1r-kVsf) and Yoshua Bengio(https://royalsocietypublishing.org/doi/full/10.1098/rspa.2021.0068)
It is ironic that “Airplanes don’t flap their wings” is often used as an example for not taking inspiration from biology because the idea that one can fly without flapping was something Wright brother learned by observing soaring birds, allowing them to separate propulsion from control. Wrights also used observations from birds to design their 3-axis control. See more here: https://arxiv.org/abs/1909.01561
So good! I learned so much though now will spend rest of my night going down a rabbit hole on dirigibles :)
Question though - isn't there a distinction in that, in scaling LLMs, you get emergent effects (like induction heads etc). Does the analogy hold?
Spot on, Dileep! Couldn't agree more!