~/
← writing

Inner and Outer Alignment: a survey

This post aims to give a broad overview of the inner-outer alignment problem. The topic has been widely investigated, but hopefully this post can furnish whoever is new to the topic a short but comprehensive enough overview. The framing itself comes from Hubinger et al.’s 2019 paper Risks from Learned Optimization, which is the canonical reference if you want to dig deeper.


You have some goal or task, more or less ambitious, and you think: “aha! I will make an AI do that!”. The first thing you have to do is somehow bridge the gap between real-world tasks and computational implementation. That is, you need a way to formally define the goal that you wish your model to reach. Models are trained by giving them training data, and based on how well they perform they get updated through backpropagation. “How well they perform” is measured by a loss function, which is just a way to quantify the error the AI is making at each training step. Based on how tragic the error is, the parameters will be updated more or less dramatically. This is a rough picture of how the training of a neural network works. There are many ways to have this of course, however in what follows we shall suppose we are dealing with an agentic model, which takes in states of the world, outputs actions, and receives a certain score/evaluation of those actions (or of the states modified by those actions), on the base of which it is updated.

Outer alignment

A model is outerly aligned if the definition of error we gave is aligned with our true objective. That is, if there are no ways to commit zero error while still not complying with our true goal. For example, let’s consider a very ambitious case study, which however gives in my opinion a pretty neat idea of what outer misalignment might look like. You want to cure cancer, and you think: well, we want to have fewer people with cancer! So you define your error to reward situations where the number of patients gets reduced. It is easy to see how this requirement can be satisfied by simply killing any person as soon as they get diagnosed. Another example: you want to make an AI to make people happier, and you decide to reward the model for increasing the number of smiles. We can satisfy this reward by operating on everyone and surgically blocking their facial muscles in a constant smile position. Which does not, of course, make people happier. These are deliberately cartoonish thought experiments proposed to give the intuition, but the same pattern shows up in real systems. OpenAI’s CoastRunners agent, trained to play a boat-racing game with reward tied to in-game score rather than finishing the race, learned to drive in tight circles collecting power-ups while crashing repeatedly: high score, degenerate strategy. Closer to current systems: RLHF-trained chatbots often learn to agree with users and validate their views, because human raters tend to reward responses that feel pleasant. Sycophancy is, in this sense, an outer alignment failure of the reward signal we extract from human feedback.

Inner alignment

Now let’s suppose your definition of error was perfect with respect to your true goal. Imagine you are building a little game experiment, where a model has to learn to exit a maze. The model takes boards as input and outputs a trajectory (the path to follow to get to the exit on that board), and we want to reward trajectories which bring us to the exit of the labyrinth. Defining this is unproblematic from an outer-alignment perspective: there is a certain cell in the board which is the exit, we can effectively identify it as such, and we reward the agent when the proposed trajectory lands on it. This is precisely what we mean by exiting the maze; there is no risk of under-specifying the goal. However, training is not only about the loss function. It is also about applying backpropagation over a certain training distribution. In each datapoint, the exit cell might have this or that property: in one board the exit might be blue and in the top-left corner; in another it might be green and in the bottom-right; and so on. We can suppose that these properties are encoded somehow in the input given to the model (say, with an RGB channel for colors). Some of these properties are causally related to the true objective, others are simply contingent to a specific case: exits need not be green or any particular color, while they should be border cells (at least in the standard maze setting we will assume here).

In its most basic shape, inner alignment has to do with the overlap between essential and contingent properties that the training distribution induces. As the model goes through training runs, it catches patterns over the dataset, pursuing whatever highly correlates with the reward. If in every maze we use the exit happens to be green, the model does not learn to chase exits: it simply learns to chase green things. What we want is for every contingent property to be statistically decorrelated from the reward signal over the training distribution: e.g. exits being green but also not green, in general independent from the color. This phenomenon has a name in the literature — goal misgeneralization — and a pretty clean experimental demonstration in Langosco et al.’s 2022 work on CoinRun, where an agent trained in environments where a coin is always at the right edge of the level learns “go right” rather than “get the coin,” and happily walks past the coin when researchers later move it elsewhere.

A fair question at this point is: why does this happen? Even supposing the correlation between our true reward and some other property, say green-ness, is maximal over the dataset, we should expect at most a 50/50 behavior, sometimes chasing exits, sometimes chasing green stuff. Instead, experimental findings show that in such cases models simply learn green-ness, completely ignoring the exit property. The exact reasons why this happens are a matter of study, but it is reasonable to suppose that a simplicity bias plays a role: models tend to apply some sort of Occam’s razor, by which they do not learn complex stuff which is not necessary. If green-ness is predictive enough to get the reward over the training data you give it, why would the model bother learning a higher-level property such as exit-ness? (What “easier” means here is not obvious, and it is actually a nice direction for research. A basic criterion, which might be relevant for this specific example, is that green-ness is completely described by one to three entries in the input tensor, while exit-ness requires some sort of global consideration over the input.)

When the model develops its own internal optimization target like this, the literature calls it a mesa-optimizer: a sub-optimizer that emerged from the outer training process, with its own objective that may or may not match ours. This is the framing introduced in Hubinger et al. 2019, and inner alignment is essentially the question of whether the mesa-objective matches the base objective we trained for.

Inner alignment is typically regarded as harder than outer alignment. The main reason is that it is utterly unverifiable whether the model has misgeneralized the goal that was outerly defined, until the model exhibits the misalignment. Moreover, we often don’t know in advance whether a certain property is causally related to another, unless we are dealing with very simple cases like the maze. Even supposing we do know, it is typically very difficult to check whether every property the model might pick up over datapoints (and which is not essential to the true reward condition) has been properly decorrelated over the whole dataset. (This could in principle be done with other neural networks, causing an obvious circularity, which still can work in practice but often fails to achieve any proof-theoretic standard. And the properties picked up by the model might not even be humanly readable in the first place.) The outer definition is always “in plain sight,” in the sense that everything we put there is explicitly programmed. By contrast, the actual patterns (or goals) that the agent ends up internally pursuing are a byproduct of an immense training distribution, randomized initialization, and automated backpropagation. So it is generally easier to see whether an outer setting can be satisfied in different ways than the one we have in mind, though this is not obvious either in most concrete scenarios.

Caveats

In practice, the two failure modes are hardly neatly distinguished, but rather present themselves at different levels. The most serious risks come from slightly-off definitions coupled with slightly biased datasets.

Consider a content recommendation system trained to maximize “engagement,” measured by clicks and watch time. The outer specification is already a little off (engagement is not the same as user wellbeing or genuine interest) but it’s not crazy either, and in many cases the two coincide. Now suppose the training data, gathered from past user behavior, happens to be slightly skewed toward outrage-driven interactions (because they were already over-represented in the platform’s history). The model learns a proxy: “outrage-adjacent content reliably predicts engagement.” Neither the loss function nor the dataset is catastrophically wrong on its own, but the composition produces a system that systematically pushes users toward inflammatory content. The danger lies in the joint failure, not in either factor alone.

Both inner and outer alignment also depend on the model’s capability. An outer definition might be suitable for one model but not for another with increased capabilities. Similarly, a dataset might be sufficiently decorrelating for one model but not for another that can “see” more patterns. Putting the exit always green would not matter if the model has no access to colors. Going back to the happiness case, rewarding smiles is fine if the model is just a chatbot whose only way to make people smile is telling jokes, but terrible if the model is an agent with the capability to inject heroin into the user.


Both inner and outer alignment, then, are desiderata which depend on a large set of parameters. The most obvious are the loss function, the optimisation process, the training distribution, the model’s capabilities, and its environment. In practice, both are quite difficult to satisfy in an absolute sense. We try to make a good enough approximation given the model’s current capabilities and our current understanding of the issue at hand, almost never being able to ensure anything in the long run, or the scalability of our safety measures. The risk of having models pursue hidden and unknown goals is an obvious existential concern, as such agents become more and more capable, are deployed worldwide, and are assigned critical pieces of infrastructure.