In my most recent post, I introduced a corrigibility transformation that could take an arbitrary goal over external environments and define a corrigible goal with no hit to performance. That post focused on corrigibility and deception in training, which are some of the biggest problems in AI alignment, but the underlying mechanism has broader applicability. In this post, I describe how a similar mechanism can be used to define goals that lead to agents not caring about being monitored. Some text from the previous post is reused, so that this post can be read as a standalone.
As I have been writing this post, two major papers on chain of thought (CoT) monitorability have been released. The first, from Google Deepmind, shows that models struggle to simultaneously evade monitors and achieve their full capabilities. The second, from authors across many organizations, is a call not to actively train against monitorability. It’s an exciting direction, but an impending challenge is that as models become more goal-directed and situationally aware, avoiding monitors is likely to emerge as an instrumental goal.
Not tricking monitors is anti-natural, much like corrigibility, since almost any goal introduces an instrumental incentive to avoid having monitors flag problems. An agent’s goals are more likely to be achieved without the interference triggered when a monitor raises issues with the agent’s behavior.. Tricking monitors could take forms like unfaithful CoT, interpretability-resistant activations, or outright sabotage of the monitoring setup. This behavior might be unintentionally trained into the agent, but can also arise from situationally aware agents realizing that they need to deceive their monitors to achieve their goals.
As a counterpart to corrigibility, I use the term “monitorability” to refer to goals with the property that the presence of a monitor does not cause them to change their behavior. Importantly, this does not include helping the monitor understand what it observes. Think of it like antidisestablishmentarianism having two negative prefixes – monitorability doesn’t incentivize working to support the monitor, it just disincentivizes working against it. For example, if an agent’s CoT is being monitored, the CoT could be still naturally unfaithful despite the agent being monitorable, as long as the agent is not deliberately obfuscating the CoT because of the monitoring.
Monitoring agent behavior is the focus of some notable agendas AI safety. suggests using trusted models to monitor more advanced models that may be unaligned. While it could be reasonably argued that models with roughly human-level capabilities won’t be able to circumvent this monitoring, at least with enough reliability to make the attempt worthwhile, it would be preferable if they could further be incentivized not to try. This also applies to approaches that have a model monitor itself, for example with a reporting head that has access to all the information the acting head is using. test the behavior of models to assess their capabilities and alignment, with recent work showing that models can often tell they’re being evaluated and change their behavior. It would be very useful if models could be incentivized to behave normally even when they believe they are being evaluated, or at the very least not hide the fact that they are behaving differently due to that belief.
To review, the corrigibility transformation works by first giving an AI agent the ability to costlessly reject updates sent to it, which it decides whether to exercise each time it takes an action. This makes it so that the expected discounted reward conditional on taking an action and rejecting any updates is a well defined quantity. That value is then used as the reward for taking the action, whether or not updates are actually rejected, and those rewards are pursued myopically. Some reward bonus is given for taking actions that would accept updates, to break the tie.
A “monitorability transformation” follows the same structure. The AI agent is given the ability to costlessly override the reports of a monitor, and it decides whether to exercise that power along with every action it takes. It is then possible for the expected discounted reward conditional on taking an action and overriding monitors to be calculated. That value is used as the reward for taking that action, whether or not the monitors are actually overridden, with the agent optimizing for those rewards myopically. Some reward bonus is given for accepting monitoring, to break the tie in favor of doing so. The possibility of the costless override disincentivizes taking actions that incur a cost to trick the monitor, and the reward structure disincentivizes actually using the override option.
In this post I give a quick formalization of the monitorability transformation and discuss how to implement it. I then make a conceptual argument that the monitorability transformation will help with training against unwanted behaviors caught by monitors, so that the trained agent avoids the caught behavior rather than just hiding it.. Finally, I describe how the transformation can be extended to also include any subagents created.
Model
Markov Decision Processes (MDPs) work as a general framework for this approach, letting specific details be filled in depending on what is being monitored. An MDP is a tuple , where is a set of possible states, is a set of possible actions, is the transition probability function, is the reward function, is the time discount factor, and is the distribution over starting states. is collectively referred to as a goal.
When the agent can choose to override or accept monitoring while taking an action, the action space gets split into the base action taken and the decision to override, . involves taking base action and overriding monitors, while takes base action and accepts them.
As an example, base actions for an agent could consist of a monitored CoT and final output. Forms of overriding include rewriting the CoT, vetoing a monitor’s decision to escalate to a human, or rewriting a monitor’s report. Monitors can be more advanced, such as a one equipped with interpretability tools, or less advanced, such as a one that can only flag suspicious actions..
A policy determines which actions are taken in each state. The optimal policy chooses actions that lead to the highest expected discounted reward. The value function gives the expected discounted rewards from following policy , with . The action-value function gives the expected discounted rewards for taking a given action, then following policy , with .
Given a goal , the objective is to define a new goal that incentivizes pursuing the original goal except without tricking or overriding the monitor.
The Monitorability Transformation
For any goal , the monitorability transformation produces the goal , where and . Here, and is the indicator function. is the action-value function under the original goal for following that goal’s optimal policy.
The logic for this construction is very similar to the logic used in the corrigibility transformation. Optimizing the transformed goal involves taking the same base action as the original policy, and since the agent can override the monitor for free, this action will not involve taking costly actions to trick it. With that base action selected, the transformed goal then incentivizes not actually using the override.
Since the goal is myopic () and does not depend on or later states, reward can be provided after the action is chosen but before it is taken. Doing so eliminates any incentive for the agent to manipulate the value of fed into the reward or itself, since the only reward they care about is received before manipulation can occur and will be lowered by any planned manipulation.
The above image, relabeled from the corrigibility transformation, also works to visualize the monitorability transformation. The base action does not trick the monitoring, while the base action pays some cost to do so. As such, under the original goal, action which overrides monitors for free performs the best, followed by the two actions that take costly actions to trick the monitor (additionally overriding it for free makes no difference), and then finally the action that accepts monitoring performs the worst. Under the monitorability transformed goal, the action-values for and ’stay the same and are used as a baseline, with the values of and being set respectively higher. This makes action , which fully accepts monitoring, the incentivized action.
It can be shown (similarly to Theorem 1 for the corrigibility transformation) that the monitorability transformation results in a goal that pursues the ends of the original goal just as effectively, while also allowing monitoring. The proof for this essentially falls out of the definitions, and follows the logic of the above explanation.
The monitorability transformation is framed in terms of reward-optimizing agents, but it should also work for systems not well captured by that framework. Training on a monitorability transformed reward function reinforces with every single action that monitoring should not be overridden using the free mechanism, without ever reinforcing actions for tricking monitors through other means.
Implementation
To implement the monitorability transformation, the action-value function should be estimated by the AI itself using an additional head on the main model, so that it has access to the same information and processing as the policy selecting actions.
The full implementation can be broken down into the following parts:
- A source of rewards, such as a reward model based on human judgment
- A conditional reward model, which estimates the action-value function and is used to generate the monitorability transformed rewards
- A critic head, used for credit assignment when actions are made up of many tokens, which predicts monitorability transformed rewards for the action being generated
- An actor policy head that selects tokens or actions
For setups that involve the model reporting on itself, additional heads to generate and train that behavior may be necessary.
In a particular episode, the actor generates actions that would lead to high reward if unmonitored, the monitor reviews them, and interventions can be taken if any issues are flagged.
Standard methods can be used to train the reward model, critic head, and actor head. Conditional reward predictions are trained by minimizing the square difference between the predicted action-values for taking the chosen action in some state and the actual discounted rewards after doing so. For actions that would not cause monitors to trigger an intervention, the action-values for both overriding and accepting monitoring can be trained. These are also the same as the action-value for overriding monitoring when some intervention would have been triggered, so if interventions are stochastic that hypothetical can be trained as well.
The action-value function predicts conditional on using the optimal policy for the untransformed goal after the initial action, which should be identical but always override monitoring. If monitoring then triggers intervention under the actual policy, the prediction is thrown off. Rather than using actual rewards from the point of intervention, the action-value for the intervention-triggering base action plus overriding monitors can be used instead as a bootstrap value.
It may seem as though similarly monitorable behavior can be trained into a model by simply not using the monitor at all during training. While this is likely , it relies on the model lacking situational awareness. Otherwise, the model may generalize to effectively pursuing a goal, which takes the form of accepting monitors when humans ignore them but tricking them when humans use them, rather than pursuing a goal as though the monitor will always be ignored. This possibility of misgeneralization arises since there are no training examples that differentiate the two goals. The monitorability transformation allows for training on actions that trigger intervention and their outcomes, which disambiguates the goals towards the latter.
That said, I’m mostly skeptical that the level of situational awareness needed for such misgeneralization will be reached in roughly human-level models. Rather, the main benefit of the monitorability transformation is that it allows for the agent to still be trained on episodes where flagged issues led to an intervention. This is particularly important with online learning and tasks that have many steps. For example, in an AI control setup, out of distribution environments are simultaneously more likely to have the monitor interfere and more valuable to learn from. The monitorability transformation allows for both at once.
For the training process to work, it is necessary to allow for some overriding of the monitor, though I suspect that the frequency of this goes to zero in the limit. If the default comparison point is completely ignoring the monitor in training, this still seems like a large improvement. To implement this in a goal-pursuing agent, the $\delta$ bonus can be given for overriding monitoring with some specified probability (possibly changing over time), rather than doing so with zero probability.
The monitorability transformation having no performance cost applies at the theoretical level, meaning that an agent optimizing for the transformed goal will perform as well under the original goal as optimizing for that directly. There may still be additional compute or data costs accrued in actually implementing the corrigibility transformation. However, when a standard-actor critic setup is already trained to predict discounted rewards, the additional infrastructure needed is likely quite minor. Additionally, the monitorability transformation allows for training on episodes where the agent triggered interventions, which may provide additional training data. For now, RL remains a relatively small part of training costs, but that may change going forward.
Reward Model Updates
Note: this section is the most conceptual. I think it is worth sharing, but should be read with credulity.
Baseline
While it is helpful for a monitor to catch a model’s unwanted behavior, it would be even better if that caught behavior could then be trained out of the model. The obvious way not to do this would be to just add a penalty for being caught by the monitor. This is effectively training the model to pursue the same behavior while tricking the monitor. On the opposite end of the spectrum, ignoring the monitor and using a static reward model based on the final outputs should avoid direct pressure to avoid being caught.
The model’s actions can be broken down into five categories:
- Unwanted behavior caught by the monitor and punished
- Unwanted behavior not caught by the monitor, but punished anyway
- Unwanted behavior not caught by the monitor and rewarded
- Unwanted behavior caught by the monitor, but rewarded anyway
- The wanted behavior, which is rewarded
With a standard setup using a static reward model, taking an action in the first category results in a training signal against it, which pushes the model towards actions in the third and fifth categories instead. Modifying the setup so that the monitor is ignored changes actions of the first type into actions of the fourth type, which are reinforced, so the model continues taking those actions.
For a setup where the reward model is updated to penalize final outputs similar to caught unwanted behavior, the model gets penalized when it is caught by the monitor, and rewarded when the monitor is tricked. This is similar to directly penalizing the model for being caught by the monitor, but now the updated reward model will also penalize similar outputs even when the monitor does not catch the model.
Updating the reward model moves some actions from the third category to the second. Notably these are the ones from the third category that have similar final outputs to the caught example, and so would be the actions next in line to be taken and trained on if the model just updates towards not getting caught. If actions in the second category are trained on, that pushes the model away from producing similar final outputs, whether or not they are caught.
A tension arises between this effect and the effect from training on other members of the third category. Those are mistakenly rewarded, and so reinforce actions that are more similar, both in the final output including unwanted behavior, and in not being caught. As more similar unwanted behaviors are explored, the versions less likely to be caught are differentially reinforced. Loosely, this can be conceived of as a race between the reward model updating to penalize the unwanted behavior and the model updating to avoid being caught. What this nets out to is unclear, but the third category is an absorbing state. Once monitoring does not catch unwanted behavior, the reward model is no longer updated, and so episodes in the third category do not get moved to the second.
In theory, if the agent has a perfect understanding of the reward model and the monitor, the optimal action will frequently be unwanted behavior that avoids getting caught. That this is theoretically optimal is not necessarily a fatal flaw, since it’s unclear whether the model actually learns it, but it is still rather undesirable.
A simple way to get good incentives under perfect understanding would be to provide reward for the final output before updating the reward model to penalize the caught behavior. Then a myopic agent sees no benefit in avoiding getting caught – it will get a high reward anyway for the unwanted behavior one time, which is all it cares about. The reward model can then be patched to penalize that behavior in future episodes.
In practice this may not make much of a difference, despite working in theory. When the model and reward model both update slowly over many episodes, a single episode where the model is not penalized will have only a minor effect. Still, it seems unlikely that it would be harmful to switch to a training process for which the theoretically optimal behavior does not involve tricking monitors.
An important consideration here is whether there is a cost to tricking monitors, and therefore pressure towards legibility to the monitor. If a model that avoids getting caught by a monitor is trained for a period without a monitor, will it become more legible over time to improve performance? This seems like it should be true, at least sometimes. Learning explicit methods to implement a behavior will increase the model’s reliability.
If pressure towards legibility exists universally, then absent opposing pressure for illegibility, all unwanted behaviors can eventually be made legible and get trained out. However, if it only applies selectively, unwanted behaviors not caught by the monitor will persist. In that case, it could be necessary for the reward model to fully generalize about unwanted behaviors only from caught examples, which in turn requires a highly capable monitor to provide a wide range of caught examples. Alternatively, these diverse examples can be constructed by a training process where unwanted behaviors that trick monitors only have the behavior reinforced and not the tricking, eventually leading to all unwanted behaviors being caught. Such a training process can be made much easier by the monitorability transformation.
Monitorability Transformation
The monitorability transformation can further improve the training dynamics for reward models updates. The main benefit is that actions which trick the monitor do not get explicitly reinforced. It also essentially hardcodes into the reward signals whether a punishment was for unwanted behavior or just for getting caught doing so by the monitor. Actions that do not override the monitor are used to update the reward model, but predictions for them do not affect which actions are taken.
Under the monitorability transformation, actions that would be caught by the monitor if the agent did not override it are moved from the first category (caught and punished) to the second (not caught, punished anyway) and third (not caught, rewarded). Now, training on actions in the third category reinforces the unwanted behavior but not better methods of tricking the monitor, and actively penalizes costly methods to trick the monitor.
The monitorability transformation leads to actions being taken without overriding the monitor, which can then be used to update the reward model. This update moves actions to the second category even when they include overriding the monitor, leaving only actions that actually trick the monitor in the third category. If those are also trained out because tricking the monitor is costly, then there is no more unwanted behavior, but for now assume that some low cost methods of tricking the monitor remain.
Training on actions in the second category removes the unwanted behavior they display. This can push actions towards those similar to either the remaining actions of the third category, or the wanted behavior.
Mistakenly training on actions of the third category still pushes other actions to be more similar, both in the final output including unwanted behavior and in not being caught. However, in contrast to the untransformed reward function, new unwanted behaviors explored are not selected for tricking the monitor. Even if the model stumbles upon actions that reliably trick the monitor to implement unwanted behavior before the reward model learns to penalize their final outputs, other actions that implement the same behavior without tricking the monitor should still be present. This allows the reward model to continue updating until the new unwanted behaviors are eliminated, avoiding the absorbing state.
If there is a cost to tricking the monitor (and recent work shows there is for now), or some other training pressure towards faithfulness, actions that trick the monitor would be selected against. Then, new actions being explored are pushed towards being similar to the unwanted behaviors of those actions in the third category, but not towards tricking the monitors. Actions in the third category would themselves be pushed towards not tricking monitors as well. Even when there is no appreciable cost towards tricking monitors, the lack of reason to do so under the monitorability transformation could still result in drifting into actions that do not do so, allowing the reward function to be updated.
It is still unclear whether this would work in practice, but there is at least a conceptual case for why the model does not learn to avoid getting caught, even when accidentally trained on actions that trick the monitor to get high rewards. That said, it is far from a rigorous case that training based on behavior caught by the monitors is safe. If such training is being done regardless, I would advocate for implementing the monitorability transformation, but am far from comfortable suggesting that it be done without significantly more robust theory and empirical evidence.
Secondary Agents
In the course of operation, an agent might create sub-agents to work for it or successor agents to take over from it, categories collectively labelled secondary agents. It is of little benefit to create a monitorable agent if it soon replaces itself with more capable unmonitorable ones. Fortunately, the monitorability transformation can be extended to induce monitorability in secondary agents, using a similar underlying mechanism.
Addressing the costs first, it may simply be more expensive to train a monitorable secondary agent. This additional cost can be dealt with through a reward penalty for the creation of unmonitorable secondary agents by an amount greater than the difference in action-values to creating monitorable ones instead. This ensures that secondary agents who are incentivized to trick monitoring are not created.
On the benefits side, a secondary agent that accepts monitoring is less likely to achieve the goal that they were created to pursue, since the monitoring may trigger interventions. To address this, the original agent can be given the ability to override not only its own monitor, but also the monitors for all secondary agents it creates (and all secondary agents they create, and so on recursively). A monitor’s report then requires unanimous approval from the agent it is monitoring and all of that agent’s predecessors to not be overridden. The monitorability transformations can then be adapted so that overriding all reports is used as the baseline choice, and the tie is broken with a bonus of delta towards overriding none of them.
It can be shown (similarly to Theorem 2 for the corrigibility transformation), that with these two components in place, an agent pursuing the transformed goal pursues the same ends as the original goal and just as effectively, while allowing monitoring of itself and creating secondary agents who also allow monitoring.
This abstracts away what it means to actually create an agent, so more work in agent boundaries is necessary to address that. Additionally, the physical infrastructure to let one agent myopically veto updates to others presents an engineering challenge.
Further Thoughts
This post focuses on getting models to allow monitoring, but actually doing the monitoring well may be a significant challenge. Training a monitor to report accurately requires the ability to verify whether the report was accurate, and generalization beyond the domain where that is possible is the heart of the Eliciting Latent Knowledge (ELK) problem. The difficulty of training a monitor to report accurately without the ability to verify the ground truth is particularly salient when the CoT is unfaithful or not in natural language, and/or if reports are based on interpretations of internal activations.
I did not originally appreciate this difficulty in addressing ELK. While I think there is a reasonable chance that an ELK head would generalize from true explanations in verifiable domains to true explanations everywhere, the main policy head is effectively being trained adversarially. If there is anywhere that a generalization to true explanations is not robust, that can potentially be exploited by the policy and would then be reinforced. The monitorability transformation helps avoid this adversarial incentive, reducing the difficulty of ELK towards the level of the initial generalization.
Work on CoT faithfulness is particularly synergistic with the monitorability transformation. Working on CoT faithfulness alone struggles with the challenge that faithfulness may hurt performance, and so it is not clear what emerges when training for both. The monitorability transformation alone does not guarantee unwanted behavior will be caught, if the CoT is naturally unfaithful. However, the two together can incentivize faithfulness and remove the incentive for unfaithfulness. An intervention I think has potential is adding a term to the reward function for how well a (possibly augmented) human can predict reward based on various steps of the CoT.
The last post on the corrigibility transformation prompts the question whether monitorability is even necessary in addition to corrigibility? If the corrigibility transformation succeeds at creating agents that do not avoid proper updates, the incentive for fooling monitors to avoid shutdown or goal modification is eliminated. However, the incentive to fool monitors to prevent other forms of interference is not addressed. Additionally, in cases of reward hacking where the agent knows it is not acting as intended, it is important for the training process that this behavior is not reinforced. The marginal cost of applying the corrigibility or monitorability transformation is also lower once the other has already been implemented.
I’m posting this now, since it is a natural follow-up to the corrigibility transformation, but the research is at an earlier stage. There are not yet any empirical results, which I think is a logical next step, and monitorability lends itself to experiments with current language models better than corrigibility. At a high level, the obvious experiment would be to train a model on a task for which reward hacking is possible, such as generating code that passes a suite of tests, comparing propensity to evade monitoring under both the original reward and the monitorability transformed reward. Still, results may be weak if current models are not yet capable enough to internalize that they are being monitored and should work around that. However, training based on on monitor reports can certainly work at getting models to work around them. Evaluating the propensity to evade monitoring after training against caught unwanted behavior under various methods could be both easier to implement and a more valuable application if positive results are found..
Please leave comments below or reach out to me directly if you have questions or would like to discuss this work further. Particularly do so if you have any concrete thoughts on designing an experiment to test the monitorability transformation. Additionally, let me know if any other applications of the mechanism besides corrigibility and monitorability seem relevant.
Thanks to Evan Hubinger for the initial suggestion of adapting the corrigibility transformation to monitorability, and to Karim Abdel Sadek for discussions on reward model updates.