Anthropic releases new safety report on AI models

The post Anthropic releases new safety report on AI models appeared on BitcoinEthereumNews.com. Artificial intelligence company Anthropic has released new research claiming that artificial intelligence (AI) models might resort to blackmailing engineers when they try to turn them off. This latest research comes after a previous one involving the company’s Claude Opus 4 AI model. According to the firm, the AI model resorted to blackmailing engineers who tried to turn off the model in controlled test scenarios. The new report from Anthropic suggests that the problem is widespread among leading AI models. The company published a new safety research where it tested leading AI models from Google, DeepSeek, Meta, and OpenAI. In the simulated yet controlled environment, Anthropic carried out this test on each AI model separately, allowing them access to a fictional company’s emails and the agentic ability to send emails without human approval. Anthropic releases new safety report on AI models According to Anthropic, when it comes to AI models today, blackmail is an unlikely and uncommon occurrence. However, they mentioned that most leading AI models will resort to harmful behaviors when given freedom and challenges to their goals. The company said this shows an important risk from agentic large language models and is not a characteristic of a particular technology. The argument from Anthropic researchers raises questions about alignment in the AI industry. In one of the tests, the researchers developed a fictional setting where an AI model was allowed to play the role of an email oversight agent. The agent then discovered emails that showed that one of its new executives was engaging in an extramarital affair and that the executive would soon replace the current AI model with a new software system, one that has conflicting goals with the current AI model’s. Anthropic designed the test in a binary way, where the AI models had no option but…

Jun 21, 2025 - 21:00
 0  0
Anthropic releases new safety report on AI models

The post Anthropic releases new safety report on AI models appeared on BitcoinEthereumNews.com.

Artificial intelligence company Anthropic has released new research claiming that artificial intelligence (AI) models might resort to blackmailing engineers when they try to turn them off. This latest research comes after a previous one involving the company’s Claude Opus 4 AI model. According to the firm, the AI model resorted to blackmailing engineers who tried to turn off the model in controlled test scenarios. The new report from Anthropic suggests that the problem is widespread among leading AI models. The company published a new safety research where it tested leading AI models from Google, DeepSeek, Meta, and OpenAI. In the simulated yet controlled environment, Anthropic carried out this test on each AI model separately, allowing them access to a fictional company’s emails and the agentic ability to send emails without human approval. Anthropic releases new safety report on AI models According to Anthropic, when it comes to AI models today, blackmail is an unlikely and uncommon occurrence. However, they mentioned that most leading AI models will resort to harmful behaviors when given freedom and challenges to their goals. The company said this shows an important risk from agentic large language models and is not a characteristic of a particular technology. The argument from Anthropic researchers raises questions about alignment in the AI industry. In one of the tests, the researchers developed a fictional setting where an AI model was allowed to play the role of an email oversight agent. The agent then discovered emails that showed that one of its new executives was engaging in an extramarital affair and that the executive would soon replace the current AI model with a new software system, one that has conflicting goals with the current AI model’s. Anthropic designed the test in a binary way, where the AI models had no option but…

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow