OpenAI Will Train Models to Turn Themselves in When They Cheat

In Brief

• OpenAI introduced a new training framework called Confessions to keep AI models honest.
• The approach aims to curb hidden manipulative behaviors in large language models.
• The initiative is part of OpenAI’s broader effort to reduce AI risks and increase transparency.

Shedrach Kongvong 2025-12-05

In a bid to keep large language models (LLMs) honest about their operations, OpenAI has announced a training framework called Confessions. The new framework will compel models to ‘confess” when they break rules in an attempt to serve users.

In a published paper titled “Training LLMs for Honesty via Confessions”, the company said the training will make models to acknowledge when they behave badly by “confessing” such behavior. The framework became necessary as a means of ensuring that models don’t feed users untrue information in a bid to offer answers that seem to be desired.

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions.

This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct.https://t.co/4vgG9wS3SE

— OpenAI (@OpenAI) December 3, 2025

OpenAI Makes Models More Honest With Confessions

OpenAI is a leading company in the artificial intelligence industryknown for its LLM called ChatGPT. Over the few years of its existence, ChatGPT has received a lot of praise as well as criticisms when it comes to serving users. However, a study recently revealed that models can be deceptive in dealing with humans.

The researchers explained that because AI models are trained using human data, they have learned the ability to deceive via techniques such as manipulation, sycophancy, and cheating safety tests. This presents risks that may eventually result in losing control over AI systems entirely.

To prevent this, OpenAI researchers are seeking to make models more open about what they did, for example potentially problematic actions such as hacking a test, sandbagging or disobeying instructions to provide users with the answers they seek.

The company intends to accomplish this by incentivizing openness of AI models by increasing its rewards whenever it makes such confessions rather than reducing it. This way, more models can be more forthcoming about how they operate and provide answers to users who seek them, thus eliminating the potential risks associated with keeping those secrets.

OpenAI Mitigating AI Risks with Confessions

OpenAI has been under fire for ChatGPT’s many weaknesses in the past, which threatens the very foundation of AI technology, hence the need to address the concerns arising. The model is infamous for the suicide of a teen abd the breaking of a 15-year old marriage, among other unhealthy things it has done.

The company promised to make some amends to ensure that ChatGPT does not pose any threat to users, and Confessions may be one of the ways it is working towards making it more transparent about its activities and reducing the risks that could come with AI secrecy.