Defining Monitorable and Useful Goals — LessWrong
In my most recent post, I introduced a corrigibility transformation that could take an arbitrary goal over external environments and define a corrigible goal with no hit to performance. That post focused on corrigibility and deception in training, which are some of the biggest problems in AI alignment, but the underlying mechanism has broader applicability. In […]