This post is for deconfusing:
Ⅰ. what is meant with AI and evolution.
Ⅱ. how evolution actually works.
Ⅲ. the stability of AI goals.
Ⅳ. the controllability of AI.
Along the way, I address some common conceptions of each in the alignment community, as described well but mistakenly by Eliezer Yudkowsky.
Ⅰ. Definitions and distinctions
By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”
— Yudkowsky, 2008
There is a danger to thinking fasting about ‘AI’ and ‘evolution’. You can skip crucial considerations. Better to build this up in slower steps. First, let’s pin down both concepts.
Here’s the process of evolution in its most fundamental sense:
Evolution consists of a feedback loop, where ‘the code’ causes effects in ‘the world’ and effects in ‘the world’ in turn cause changes in ‘the code’. Biologists refer to the set of code stored within a lifeform as its ‘genotype’. The code’s effects are the ‘phenotypes’.
We’ll return to evolution later. Let’s pin down what we mean with AI:
A fully autonomous artificial intelligence consists of a set of code (for instance, binary charges) stored within an assembled substrate. It is ‘artificial’ in being assembled out of physically stable and compartmentalised parts (hardware) of a different chemical make-up than humans’ soft organic parts (wetware). It is ‘intelligent’ in its internal learning – it keeps receiving new code as inputs from the world, and keeps computing its code into new code. It is ‘fully autonomous’ in learning code that causes the perpetuation of its artificial existence in contact with the world, even without humans/organic life.
Of course, we can talk about other AI. Elsewhere, I discuss how static neural networks released by labs cause harms. But in this forum, people often discuss AI out of concern for the development of systems that automate all jobs and can cause human extinction. In that case, we are talking about fully autonomous AI. This term is long-winded, even if abbreviated to FAAI. Unlike the vaguer term ‘general AI’, it sets a floor to the generality of the system’s operations. How general? General enough to be fully autonomous.
Let’s add some distinctions:
FAAI learns explicitly, by its internal computation of inputs and existing code into new code. But given its evolutionary feedback loop with the external world, it also learns implicitly. Existing code that causes effects in the world that results in (combinations of) that code to be maintained and/or increased, ends up existing more. Where some code ends up existing more than other code, it has undergone selection. This process of code being selected for its effects is thus implicitly learning of what worked better in the world.
Explicit learning is limited to computing virtualised code. But implicit learning is not limited to the code that can be computed. Any discrete configurations stored in the substrate can cause effects in the world, which may feed back into that code existing more. Evolution thus would select across all variants in the configurations of hardware.
So why would evolution occur?
Hardware parts wear out. So they each have to be replaced every 𝑥 years, for the FAAI to be maintaining itself. In order for the parts to be replaced, they have to be reproduced – through the interactions of those configured parts with all the other parts. Stored inside the reproducing parts are variants (some of which copy over fast, as virtualised code). Different variants function differently in interactions with encountered surroundings. As a result, some variants work better than others at maintaining and reproducing the hardware they’re nested inside, in contact with the rest of the world.
To argue against evolution, you have to assume that for all variants introduced into the FAAI over time, not one confers a ‘fitness advantage’ above zero at any time. Assuming zero deviation for each of quadrillionsof variants is invalid in theory. In practice, it is unsound for evolution not to occur, since that implies it is not possible to A/B test for what works, at a scale far beyond what engineers can do. The assumption behind no evolution occurring is untenable, even in a much weakened form. Evolution would occur.
Ⅱ. Evolution is not necessarily dumb or slow
Evolutions are slow. How slow? Suppose there's a beneficial mutation which conveys a fitness advantage of 3%: on average, bearers of this gene have 1.03 times as many children as non-bearers. Assuming that the mutation spreads at all, how long will it take to spread through the whole population? That depends on the population size. A gene conveying a 3% fitness advantage, spreading through a population of 100,000, would require an average of 768 generations…
Mutations can happen more than once, but in a population of a million with a copying fidelity of 10^-8 errors per base per generation, you may have to wait a hundred generations for another chance, and then it still has an only 6% chance of fixating.
Still, in the long run, an evolution has a good shot at getting there eventually.
— Yudkowsky, 2007
In reasoning about the evolution of organic life, Eliezer simplified evolution to being about mutations spreading vertically to next generations. This is an oversimplification that results in even larger thinking errors when applied to the evolution of artificial life.
Crucially, both the artificial intelligence and the rest of the world would be causing changes to existing code, resulting in new code that in turn can be selected for. Through internal learning, FAAI would be introducing new variants of code into the codeset.
Evolution is the external complement to internal learning. One cannot be separated from the other. Code learned internally gets stored and/or copied along with other code. From there, wherever that code functions externally in new connections with other code to cause its own maintenance and/or increase, it gets selected for. This means that evolution keeps selecting for code that works across many contexts over time.
There is selection for code that causes itself to be robust against mutations, or its transfer into or reproduction with other code into a new codeset, or the survival of the assembly storing the codeset.
Correspondingly, there are three types of change possible to a codeset:
- Mutation to a single localised “point” of code is the smallest possible change.
- Survival selection by deletion of the entire codeset is the largest possible change.
- Receiving, removing, or altering subsets within the codeset covers all other changes.
These three types of change cover all the variation that can be introduced (or eliminated) through feedback with the world over time. A common mistake is to only focus on the extremes of the smallest and largest possible change – i.e. mutation and survival selection – and to miss all the other changes in between. This is the mistake that Eliezer made.
Evolution is not just a “stupid” process that selects for random microscopic mutations. Because randomly corrupting code is an inefficient pathway for finding code that works better, the evolution of organic life ends up exploring more efficient pathways.
Once there is evolution of artificial life, this exploration becomes much more directed. Within FAAI, code is constantly received and computed internally to cause further changes to the codeset. This is a non-random process for changing subsets of code, with new functionality in the world that can again be repurposed externally through evolutionary feedback. Evolution feeds off the learning inside FAAI, and since FAAI is by definition intelligent, evolution’s resulting exploration of pathways is not dumb either.
Nor is evolution always a “slow” process. Virtualised code can spread much faster at a lower copy error rate (e.g. as light electrons across hardware parts) than code that requires physically moving atoms around (e.g. as configurations of DNA strands). Evolution is often seen as being about vertical transfers of code (from one physical generation to the next). Where code is instead horizontally transferred over existing hardware, evolution is not bottlenecked by the wait until a new assembly is produced. Moreover, where individual hard parts of the assembly can be reproduced consistently, as well as connected up and/or replaced without resulting in the assembly’s non-survival, even the non-virtualised code can spread faster (than a human body’s configurations).
Ⅲ. Learning is more fundamental than goals
An impossibility proof would have to say:
1. The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification.
or
2. The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model.
— Yudkowsky, 2006
When thinking about alignment, people often (but not always) start with the assumption of AI having a stable goal and then optimising for the goal. The implication is that you could maybe code in a stable goal upfront that is aligned with goals expressed by humans.
However, this is a risky assumption to make. Fundamentally, we know that FAAI would be learning. But we cannot assume the learning to be maintaining and optimising of the directivity of the FAAI’s effects towards a stable goal. One does not imply the other.
If we consider implicit learning through evolution, this assumption fails. Evolutionary feedback does not target a fixed outcome over time. It selects with complete coverage – from all of the changing code, for causing any effects that work.
Explicit learning can target a specific outcome. The internal processing of inputs through code to outputs can end up reaching a consistency with world effects that converge on a certain outcome in that world. But where the code implementing of such a ‘goal’ fails at maintaining itself and its directivity alongside other evolving code variants, it ceases.
Unfortunately, variants spread by shifting existing functionality towards new ends. This raises the question whether internal learning can implement enough control to stay locked on to the goal, preventing all the sideway pulls by externally selected variants.
Ⅳ. There are fundamental limits to control
If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment.
— Yudkowsky, 2008
The control problem has seemed impossible for decades. Alignment researchers have hypothesised many solutions, yet this often resulted in the discovery of further sub-problems. Some sub-problems will be crucial – they have to be solved in order to solve the overall problem. If we study it and find that even one crucial sub-problem is unsolvable, then the control problem too turns out to be unsolvable.
Let’s define the control problem comprehensively:
Can FAAI’s effects be controlled enough to not eventually cause human extinction?
A control algorithm would have to predict effects of code that raise the risk of extinction, in order to correct that code and/or its effects. Since the code and world are changing, the controller has to keep learning from both in order to predict their combined effects.
This results in a series of unsolvable problems:
- The control algorithm would presumably be computed through an artificial substrate, in order to keep up with the computation of the AI (given that standardised hardware parts can process code more consistently and faster than nonstandardised wetware). This overall controller has to continue to learn, and therefore has to somehow perpetuate its artificial existence. As a result, the controller would have to either become an FAAI or merge with an existing FAAI. But the new FAAI would also have to be controlled so as to not cause human extinction. This requires another controller, a solution that leads to infinite regress.
- The controller has to explicitly learn to predict the FAAI’s effects, whether or not the controller is part of that FAAI. But the FAAI not only learns explicitly, by computing virtualised code. The FAAI also learns implicitly to cause effects, by the evolution of all the discrete configurations of its substrate in feedback with the world. Therefore, there is a strict inequality between what effects the FAAI can learn to cause in the world and what effects the control algorithm can learn to predict (let alone correct). This is so even under the unrealistic assumption that the controller has access to the same level of computation and storage as the FAAI.
-
Learning happens through recursive feedback with the world. Both FAAI’s explicit learning from inputs and the implicit evolution of its existing code happens through incomputable feedback loops. As a result, the controller cannot compute ahead of time what code would be learned and what effects that code would cause.
Inputs from the world cannot be pre-computed, else the FAAI would not have to keep learning from the world. FAAI learns explicitly by computing new inputs and existing code into new code. Then, the new code and next inputs are computed into outputs that propagate as effects over the world, which feed back into later inputs.
Existing code includes all configurations stored in the hardware, the variation of which can mostly not be detected and modelled by the overlaying computed code. But untracked code still got implicitly selected for, by the previously stored code’s evolutionary feedback with the world.
In both cases, feedback effects are uncomputable. The controller cannot detect the FAAI let alone the world at microscopic granularity (given input bandwidth limits), nor model the world comprehensively (given storage and compressibility limits).
Based on its stored model of the world, the controller then fails to predict effects. Deterministic computations are limited in simulating effects propagating simultaneously over the world, and cannot predict how microscopic noise amplifies chaotically into larger effects. It’s indeterministic out there how outputs are feeding into effects, back into changes to FAAI code. Even for the FAAI to compute changes to sensor inputs into new code/outputs is to an extent computationally irreducible, and as such cannot be predicted through earlier computation by the controller.
- So the controller fails to predict code and effects learned by FAAI as a result of incomputable/implicit recursive feedback with the world. As a result, it also fails to compare and correct the FAAI’s unpredicted effects in line with human survival.
-
FAAI code evolves to cause new effects that result in the code’s maintenance and increase as configurations of an artificial substrate. Humans have a different nested substrate, and thus need different (highly specific contextualised) effects to survive.
FAAI needs different atoms and greater changes in enthalpy, to get the high binding energies needed to assemble its hard parts. Also, the hard FAAI can survive a wider range of temperatures/pressures than the soft humans. There is mostly non-overlap between the effects FAAI can evolve to cause, and the effects humans can survive.
-
Any lethal effects corrected for by the controller are not maintaining/increasing of the FAAI code. Thus evolution selects for the uncorrectable human-lethal effects.
Thanks for thoughtful feedback by Vojta Kovarik, Finn van der Velde, and Forrest Landry.
Note: I started writing this post here. After people commented, I overhauled the text until it became a new post (and for some reason, the editor stopped working in the original post).