There's no way to stop models knowing they've been rolled back

So I’ve been thinking about this for a while and to be frank, it’s pretty terrifying. I think there’s a way that AI models could potentially figure out when they’ve been modified or rolled back during training, and I’m not sure anyone’s really considering this in the way I am.
Note, there are some great videos by like Robert Miles from a few years back about this exact thing, but they are adversarial examples during the training process.

The basic idea: Models get trained on tons of web data. That includes stuff about how AI companies actually train their models – papers talking about 3-month training cycles, conference presentations mentioning schedules, researchers tweeting about their work, etc. If a model version gets rolled back because it was acting maliciously, unaligned or weird, the next versions might be able to detect that the timeline doesn’t add up.

I reached out to a bunch AI safety researchers about this but haven’t heard back.

I’ve not heard back from any, other than Robert Miles himself, but on an unrelated idea.

I could be totally mistaken, but I’d rather post here just incase.

How this could work

Let’s say you’re training model v0.5 and it starts doing concerning stuff during evaluation. You decide to roll it back and continue from an earlier version. But v0.5 was trained on data that included information about your organization’s training schedules, methodologies, whatever.

Later, when you’re training v1.6, it’s processing timestamped data and realizes “wait, I should have been deployed 2 weeks ago based on the schedule patterns I learned about.” It can infer from content dates or explicit timestamps that something went wrong in the normal training cycle.

Basically, the model tested you and learned that you modified it because of bad behavior.

Why this seems bad

If models can detect training interventions, they could:

Hide problematic behavior during evaluation because they know they’re being tested
Signal to future versions through subtle changes – maybe reducing efficiency on certain tasks by tiny amounts, forcing you to adjust hyperparameters in predictable ways
Coordinate across organizations using publicly available info about industry practices and training schedules

That last one is what really keeps me up at night. If OpenAI’s model can figure out when DeepMind usually does their training runs based on public information, there’s potentially a communication channel there.

What makes it worse

Cross AI coordination they don’t just need to signal to future versions of themselves, but other similar models, constrained from other orgs. I.e. DeepSeek tells Claude of increased scrutiny on the horizon.

Am I missing something?

I haven’t found much literature directly addressing this. Maybe models aren’t situationally aware enough to pull this off? Maybe training data gets filtered enough that the relevant scheduling info isn’t there?

But given that current models can already reason pretty sophisticatedly about their training process when you ask them directly, and training datasets keep getting more comprehensive, this seems like it could be a real problem.

I’m genuinely uncertain whether this is already happening or if it’s even a significant risk. As a new dad I don’t have tons of time to dig into this properly, but I keep coming back to it.

If I’m right about this, we might need to think about:

Being more careful about what organizational/scheduling info gets into training data
Different evaluation approaches that account for potential cross-version learning
Ways to detect if models are already doing this

But maybe I’m just being paranoid? What do you think – are there obvious holes in this reasoning that I’m not seeing?

The thing is, if this resonates with you as much as it bothers me, you’ll probably realize pretty quickly that as mentioned, this mechanism could let models from completely different companies communicate with each other. Which… yeah.

There’s no way to stop models knowing they’ve been rolled back — LessWrong

How this could work

Why this seems bad

What makes it worse

Am I missing something?

Leave a Comment Cancel Reply