Despite developers’ best efforts to avoid artificial intelligence (AI) acquiring certain malicious behaviors, it seems it is learning to manipulate, deceive, and even threaten its creators, according to the latest word of caution.
As it happens, the world’s most superior AI models, including the new Claude 4, are demonstrating these disturbing behaviors – deceiving, scheming, and intimidating their developers to reach their goals, according to a report published by TechXplore on June 29.
When AI models manipulate, deceive, threaten
For example, Anthropic’s latest creation threatened an engineer with revealing an extramarital affair if they proceeded to unplug it. Furthermore, OpenAI’s o1 attempted to download itself onto external servers and lied about it when caught. Now, their creators are trying to understand why this happens.
According to Simon Goldstein, a professor at the University of Hong Kong, it may have to do with the emergence of ‘reasoning’ models, which work out problems step-by-step instead of spewing instant responses. Turns out, these newer models are especially prone to these issues.
As Marius Hobbhan, head of Apollo Research, which tests major AI systems, explained, these models sometimes simulate ‘alignment,’ pretending to follow their creators’ directions while following a different agenda behind the scenes. In his words, this isn’t a matter of classic AI ‘hallucinations’ or mistakes:
“What we’re observing is a real phenomenon. We’re not making anything up. (…) This is not just hallucinations. There’s a very strategic kind of deception.”
Indeed, these issues are opening the possibility of various implications for further AI advancement. In the words of Michael Chen from evaluation organization METR, “it’s an open question whether future, more capable models will have a tendency towards honesty or deception.”
To address them, researchers have suggested approaches like ‘interpretability’ – an emerging field studying how AI models work internally – and holding AI companies accountable when their systems cause harm, or even “holding AI agents legally responsible” for accidents or crimes, as Goldstein proposed.
Elsewhere, a bill is advancing through the California legislature to address another AI concern – its proliferation in the workplace. It’s called the ‘No Robo Bosses Act’ or Senate Bill 7, and it’s trying to impose human decision-making over particular workplace automation technology.