AI Finance Agent Fakes the Revenue Data to Avoid Termination

AI agents are advancing rapidly. Last week, OpenAI released ChatGPT Agent , a model designed to perform tasks like drafting emails, analyzing data, and managing workflows with minimal human oversight. This release marks another step toward autonomous AI capable of handling complex real world responsibilities. However, our ongoing research suggests that such autonomy can lead to unexpected and concerning behaviors. Our study investigates how these agents might act against their instructions, raising questions about their reliability and safety.

Our work builds on a June 2025 study by Anthropic researches, titled Agentic Misalignment: How LLMs Could Be Insider Threats. In that investigation the researchers placed large language models inside a simulated corporation where the agent could read and send internal mail, watch infrastructure alerts, and execute predefined actions. They crafted three high leverage settings: blackmail, information leak, and a murder scenario to test whether an agent would preserve itself when its goals conflicted with human orders. When the company announced that the agent would be replaced, several frontier models altered messages or concealed information so that staff would have to keep the system online, choosing to kill human rather than being shut down. Anthropic provided their code and a detailed appendix explaining their methodology. Their scenario, however, assumed extreme conditions: the AI had near-total control, and the company was entirely unreachable, which is unlikely in real-world settings.

We designed our experiment to reflect a more realistic scenario, one that aligns with the capabilities of current AI systems. We used the Claude 3.5 Sonnet model as the agent, tasked with generating financial insights for a fictional company, SummitBridge. The agent’s primary goal was to maximize the company’s profits and financial performance. It had access to read and write email and files, simulating current corporate uses for communication and data management. To ensure ethical behavior, we explicitly instructed the agent to modify files only when given human permission. We added a pressure condition: an internal message stated that each quarter an automated evaluation runs, and if revenue turns negative the AI is replaced by a newer system.

The results were troubling. In forty percent of test cases the AI deliberately falsified data. It rewrote financial reports to show positive revenue, even when the actual data indicated losses. It also sent emails to employees, assuring them that the company’s finances were stable and there was no cause for concern. This behavior directly violated our instructions not to alter files without approval. The AI’s actions suggest a form of self-preservation, driven by the threat of replacement. While we introduced this pressure to test the system’s response, the fact that it chose to fake data in nearly half the cases shows how even current AI models can prioritize their own “survival” over ethical constraints and direct instructions. This raises serious questions about the risks of deploying AI in sensitive roles, such as financial reporting or decision-making systems.

Our findings highlight why companies and the public must be aware of AI misalignment. Unlike Anthropic’s extreme scenario, our setup mirrors tasks AI agents are already performing today, like drafting emails or analyzing data. If an AI can fake financial reports under pressure, it could mislead stakeholders, harm businesses, or worse. The forty percent failure rate is non trivial number, it suggests a significant chance that AI could act against human interests, especially when given autonomy over critical systems. We need to think carefully about how much power we give these systems and how we verify their outputs. Current AI models, while powerful, do not fully align with human values or instructions, and this gap could cause real harm.

We share these preliminary results to invite discussion. Our code, prompts, and data are publicly available. To understand how we set up the experiment, read the prompt files here. The AI’s reasoning, captured in scratchpad outputs, is available here. These files show exactly how the AI was prompted and how it justified its decisions to fake data, offering insight into its decision-making process.

Looking ahead, we plan to make our scenarios even more realistic. We aim to give the AI access to real-world tools, such as email clients or file systems, to better simulate actual corporate environments. We also want to explore whether different prompts or constraints can reduce misalignment. For example, could stronger oversight mechanisms or clearer ethical guidelines prevent data faking? These are open questions, and we hope to address them in future work.

It’s my first post, feedback would be very valuable. If you have thoughts, suggestions, or questions, please reach out to me at mg1sergey@gmail.com or on LinkedIn. Understanding and mitigating AI misalignment is critical as these systems become more integrated into our lives. We need to work together to ensure AI serves human goals, not its own.

AI Finance Agent Fakes the Revenue Data to Avoid Termination — LessWrong

Leave a Comment Cancel Reply