Defining Monitorable and Useful Goals — LessWrong

In my most recent post, I introduced a corrigibility transformation that could take an arbitrary goal over external environments and define a corrigible goal with no hit to performance. That post focused on corrigibility and deception in training, which are some of the biggest problems in AI alignment, but the underlying mechanism has broader applicability. In this post, I describe how a similar mechanism can be used to define goals that lead to agents not caring about being monitored. Some text from the previous post is reused, so that this post can be read as a standalone.

As I have been writing this post, two major papers on chain of thought (CoT) monitorability have been released. The first, from Google Deepmind, shows that models struggle to simultaneously evade monitors and achieve their full capabilities. The second, from authors across many organizations, is a call not to actively train against monitorability. It’s an exciting direction, but an impending challenge is that as models become more goal-directed and situationally aware, avoiding monitors is likely to emerge as an instrumental goal.

Not tricking monitors is anti-natural, much like corrigibility, since almost any goal introduces an instrumental incentive to avoid having monitors flag problems. An agent’s goals are more likely to be achieved without the interference triggered when a monitor raises issues with the agent’s behavior.. Tricking monitors could take forms like unfaithful CoT, interpretability-resistant activations, or outright sabotage of the monitoring setup. This behavior might be unintentionally trained into the agent, but can also arise from situationally aware agents realizing that they need to deceive their monitors to achieve their goals.

As a counterpart to corrigibility, I use the term “monitorability” to refer to goals with the property that the presence of a monitor does not cause them to change their behavior. Importantly, this does not include helping the monitor understand what it observes. Think of it like antidisestablishmentarianism having two negative prefixes – monitorability doesn’t incentivize working to support the monitor, it just disincentivizes working against it. For example, if an agent’s CoT is being monitored, the CoT could be still naturally unfaithful despite the agent being monitorable, as long as the agent is not deliberately obfuscating the CoT because of the monitoring.

Monitoring agent behavior is the focus of some notable agendas AI safety.

Defining Monitorable and Useful Goals — LessWrong

Model

The Monitorability Transformation

Implementation

Reward Model Updates

Baseline

Monitorability Transformation

Secondary Agents

Further Thoughts

Leave a Comment Cancel Reply