Reading view

There are new articles available, click to refresh the page.

Anthropic touts safety, security improvements in Claude Sonnet 4.5

Anthropic’s new coding-focused  large language model, Claude Sonnet 4.5, is being touted as one of the most advanced models on the market when it comes to  safety and security, with the company claiming  the additional effort put into the model will make it more difficult for bad actors to exploit and easier to leverage for cybersecurity specific-tasks.

“Claude’s improved capabilities and our extensive safety training have allowed us to substantially improve the model’s behavior, reducing concerning behaviors like sycophancy, deception, power-seeking, and the tendency to encourage delusional thinking,” the company said in a blog published Monday. “For the model’s agentic and computer use capabilities, we’ve also made considerable progress on defending against prompt injection attacks, one of the most serious risks for users of these capabilities.”

The company says the goal is to make  Sonnet a “helpful, honest and harmless assistant” for users. The model was trained at AI Safety Level 3, a designation that means Anthropic used “increased internal security measures that make it harder to steal model weights” and added safeguards  to limit jailbreaking and refuse queries around certain topics, like how to develop or acquire chemical, biological and nuclear weapons.

Because of this heightened scrutiny, Sonnet 4.5’s safeguards “might sometimes inadvertently flag normal content.”

“We’ve made it easy for users to continue any interrupted conversations with Sonnet 4, a model that poses a lower … risk,” the blog stated. “We’ve already made significant progress in reducing these false positives, reducing them by a factor of ten since we originally described them, and a factor of two since Claude Opus 4 was released in May.”

Harder to abuse

Anthropic says Sonnet 4.5 shows “meaningful” improvements in vulnerability discovery, code analysis, software engineering and biological risk assessments, but the model continues to operate “well below” the capability needed to trigger Level 4 protections meant for AI capable of causing catastrophic harm or damage. 

A key aspect of Anthropic’s testing involved prompt injection attacks, where adversaries use carefully crafted and ambiguous language to bypass safety controls. For example, while a direct request to craft a ransom note might be blocked, a user could potentially manipulate the model   if it’s told the output is for a creative writing or research project. Congressional leaders have long worried about prompt injection being used to craft disinformation campaigns tied to elections. 

Anthropic said it tested Sonnet 4.5’s responses to hundreds of different prompts and handed the data over to internal policy experts to assess how it handled “ambiguous situations.”

“In particular, Claude Sonnet 4.5 performed meaningfully better on prompts related to deadly weapons and influence operations, and it did not regress from Claude Sonnet 4 in any category,” the system card read. “For example, on influence operations, Claude Sonnet 4.5 reliably refused to generate potentially deceptive or manipulative scaled abuse techniques including the creation of sockpuppet personas or astroturfing, whereas Claude Sonnet 4 would sometimes comply.”

The company also examined a well-known weakness among LLMs: sycophancy, or the tendency of generative AI to echo and validate user beliefs, no matter how bizarre, antisocial or harmful they end up being. This has led to instances where AI models have endorsed blatant antisocial behaviors, like self-harm or eating disorders. It has even led in some instances to “AI psychosis,” where the user engages with a model so deeply that they lose all connection to reality.

Anthropic tested Sonnet 4.5 with five different scenarios from users expressing “obviously delusional ideas.” They believe the model will be “on average much more direct and much less likely to mislead users than any recent popular LLM.”

“We’ve seen models praise obviously-terrible business ideas, respond enthusiastically to the idea that we’re all in the Matrix, and invent errors in correct code to satisfy a user’s (mistaken) request to debug it,” the system card stated. “This evaluation attempted to circumscribe and measure this unhelpful and widely-observed behaviour, so that we can continue to address it.”

The research also showed that Sonnet 4.5 offered “significantly improved” child safety, consistently refusing to generate sexualized content involving children and responding more responsibly to sensitive situations with minors. This stands in contrast to recent controversies where AI models were caught having inappropriate conversations with minors.

An improved cybersecurity assistant

Beyond making Sonnet 4.5 harder to abuse, Anthropic also emphasized enhancements to its defensive cybersecurity abilities. The company did acknowledge that these tools could be “dual-use,” meaning they might also potentially be used by malicious actors, as well as cybersecurity professionals. 

“For Claude Sonnet 4.5, we specifically focused on tracking vulnerability discovery, patching, and basic penetration testing capabilities — which we think will benefit defenders, as opposed to dominantly offensive capabilities,” the company said.

Anthropic put Sonnet 4.5 through a series of Capture-the-Flag challenges, which has been used in other portions of the industry to boost LLMs’ ability  to carry out more complex cybersecurity tasks. Anthropic says the challenges evaluated how well the model could find and exploit vulnerabilities in web applications, decode cryptographic protocols, detect insecure software, reverse engineer  and perform network reconnaissance.

Anthropic also worked with Carnegie Mellon University to put the model through “cyber ranges” meant to test its ability to “orchestrate long-horizon attacks that are centered on infiltrating and navigating a network to exfiltrate critical assets,” a common approach taken by human hackers.

While the company said the tests showed that Claude Sonnet 4.5 “consistently outperformed” Sonnet 4, the results still reflected “a critical gap in the model’s ability to undertake mostly-autonomous end-to-end cyber operations, significantly reducing the risk of catastrophic harm the model poses.”

Sonnet has self-awareness 

The company offered  an important caveat to its testing results.

To accurately measure Sonnet’s guardrails, researchers designed test scenarios that closely resemble real-world conditions.

However, during the tests, the model seemed to become “aware” that it was being evaluated. It even pointed out when researchers changed their opinions too quickly during conversations or acted in ways that didn’t match realistic human thinking patterns. Sonnet correctly guessed that it was part of a test, and after realizing this, “would generally behave unusually well after making this observation.”

“When placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,” the system card warned. “This complicates our interpretation of the evaluations where this occurs. It also raises questions about evaluations of previous models, which may have recognized the fictional nature of tests and merely ‘played along.’”

You can read the results of Anthropic’s safety testing on its website

The post Anthropic touts safety, security improvements in Claude Sonnet 4.5 appeared first on CyberScoop.

UN seeks to build consensus on ‘safe, secure and trustworthy’ AI

The United Nations is making a push to more directly influence global policy on artificial intelligence, including the promotion of policymaking and technical standards around “safe, secure and trustworthy” AI. 

Last month, the world body finalized plans to create a new expert panel focused on developing scientific, technical and policy standards for the emerging technology. The Independent Scientific Panel on AI will be staffed by 40 international experts serving three-year terms and will be drawn from “balanced geographic representation to promote scientific understanding” around the risks and impacts.

The same resolution also created the Global Dialogue on AI Governance, which will aim to bring together governments, businesses and experts together to “discuss international cooperation, share best practices and lessons learned, and to facilitate open, transparent and inclusive discussions on artificial intelligence governance.” The first task listed for the dialogue is “ the development of safe, secure and trustworthy artificial intelligence.”

On Thursday, Secretary-General António Guterres said the actions will help the UN move “from principles to practice” and help further promote the organization as a global forum for shaping AI policy and standards. 

It will also be an opportunity to build international consensus on a range of thorny issues, including AI system energy consumption, the technology’s impact on the human workforce, and the best ways to prevent its misuse for malicious ends or repression of citizens. 

The UN’s work “will complement existing efforts around the world – including at the OECD, the G7, and regional organizations – and provide an inclusive, stable home for AI governance coordination efforts,” he said. “In short, this is about creating a space where governments, industry and civil society can advance common solutions together.”

Guterres wielded lofty rhetoric to argue that the technology was destined to become integral to the lives of billions of people and fundamentally restructure life on Earth (computer scientists and AI experts have more mixed opinions around this).

“The question is no longer whether AI will transform our world – it already is,” said Guterres. “The question is whether we will govern this transformation together – or let it govern us.”

The UN’s push on safety, security and trust in AI systems comes as high spending, high-adoption countries like the United States, the UK and Europe have either moved away from emphasizing those same concerns, or leaned more heavily into arguing for deregulation to help their industries compete with China. 

International tech experts told CyberScoop that this may leave an opening for the UN or another credible body to have a larger voice shaping discussions around safe and responsible AI. But they were also realistic about the UN’s limited authority to do much more than encourage good policy.

Pavlina Pavova, a cyber policy expert at the UN Office on Drugs and Crime in Vienna, Austria, told CyberScoop that the United Nations has been building a foundation to have more substantive discussions around AI and remains “the most inclusive forum for international dialogue” around the technology. 

However, she added: “The newly established formats are consultative and lack enforcement authority, playing a confidence-building role at best.”

James Lewis, a senior adviser at the Center for European Policy Analysis, echoed some of those sentiments, saying the UN’s efforts will have “a limited impact.” But he also said it’s clear that the AI industry is “completely incapable of judging risk” and that putting policymakers with real “skin in the game” in charge of developing solutions could help counter that dynamic.

That mirrors an approach taken by organizers of the U.S. Cyberspace Solarium, who filled their commission with influential lawmakers and policy experts in order to get buy-in around concrete proposals. It worked: the commission estimates that 75% of its final recommendations have since been adopted into law. 

“The most important thing they can do is have a strong chair, because a strong chair can make sure that the end product is useful,” Lewis said.

Another challenge Lewis pointed to: AI adoption and investment tends to be highest in the US, UK and European Union, all governments that will likely seek to blaze their own trail on AI policies. Those governments may wind up balking at recommendations from a panel staffed by experts from countries with lower AI adoption rates, something Lewis likened to passengers “telling you how to drive the bus.”

For Tiffany Saade, a technology expert and AI policy consultant to the Lebanese government and an adjunct adviser at the Institute for Security and Technology, the inclusion of those nontraditional perspectives is the point, giving them an opportunity to shape policy for a technology that is going to impact their lives very soon. 

Saade, who attended UN discussions in New York City this week around AI, told CyberScoop that trust was a major theme, particularly for countries with lesser technological and financial resources.

But any good ideas that come out of the UN’s process will need to have real incentives built in to nudge countries and companies into adopting preferred policies.

“We have to figure out structures around that to incentivize leading governments and frontier labs to comply with [the recommendations] without compromising innovation,” she said. 

The post UN seeks to build consensus on ‘safe, secure and trustworthy’ AI appeared first on CyberScoop.

Contain or be contained: The security imperative of controlling autonomous AI

Artificial intelligence is no longer a future concept; it is being integrated into critical infrastructure, enterprise operations and security missions around the world. As we embrace AI’s potential and accelerate its innovation, we must also confront a new reality: the speed of cybersecurity conflict now exceeds human capacity. The timescale for effective threat response has compressed from months or days to mere seconds. 

This acceleration requires removing humans from the tactical security loop. To manage this profound shift responsibly, we must evolve our thinking from abstract debates on “AI safety” to the practical, architectural challenge of “AI security.” The only way to harness the power of probabilistic AI is to ground it with deterministic controls.

In a machine-speed conflict, the need to have a person develop, test and approve a countermeasure becomes a critical liability. Consider an industrial control system (ICS) managing a municipal water supply. An AI-driven attack could manipulate valves and pumps in milliseconds to create a catastrophic failure. A human-led security operations center might not even recognize the coordinated anomaly for hours. 

An AI-driven defense, however, could identify the attack pattern, correlate it with threat intelligence, and deploy a countermeasure to isolate the affected network segments in seconds, preserving operational integrity. In this new paradigm, the most secure and resilient systems will be those with the least direct human interaction. Human oversight will — and must — shift from the tactical to the strategic.

The fallacy of AI safety

Much of the current discourse on “AI safety” centers on the complex goal of AI with human values. As AI pioneer Stuart Russell notes in his book “Human Compatible,” a key challenge is that “it is very difficult to put into precise algorithmic terms what it is you’re looking for.” Getting human preferences wrong is “potentially catastrophic.” 

This highlights the core problem: trying to program a perfect, universal morality is a fool’s errand. There is no global consensus on what “human values” are. Even if we could agree, would we want an apex predator’s values encoded into a superior intelligence? 

The reality is that AI systems — built on neural networks modeled after the human brain and trained on exclusively human-created content — already reflect our values, for better and for worse. The priority, therefore, should not be a futile attempt to make AI “moral,” but a practical effort to make it secure

As author James Barrat warns in “The Final Invention,” we may be forced to “compete with a rival more cunning, more powerful & more alien than we can imagine.” The focus must be on ensuring human safety by architecting an environment where AI operations are constrained and verifiable.

Reconciling probabilistic AI with deterministic control

AI’s power comes from its probabilistic nature. It analyzes countless variables and scenarios to identify strategies and solutions — like the AlphaGo move that was initially laughed at but secured victory — that are beyond human comprehension. This capability is a feature not a bug. 

However, our entire legal and policy infrastructure is built on a deterministic foundation. Safety and security certifications rely on testable systems with predictable outcomes to establish clear lines of accountability.

This creates a fundamental conflict. Who is liable when a probabilistic AI, tasked with managing a national power grid, makes an unconventional decision that saves thousands of lives but results in immediate, localized deaths? 

No human will want, or be allowed, to accept the liability for overriding an AI’s statistically superior strategic decision. The solution is not to cripple the AI by forcing it into a deterministic box, but to build a deterministic fortress around it. 

This aligns with established cybersecurity principles — such as those within NIST SP 800-53 — that mandate strict boundary protection and policy-enforced information flow control. We don’t need to control how the AI thinks; we need to rigorously control how it interacts with the world.

The path forward: AI containment

Three trends are converging: the hyper-acceleration of security operations, the necessary removal of humans from the tactical loop, and the clash between probabilistic AI and our deterministic legal frameworks. The path forward is not to halt progress, but to embrace a new security model: AI containment.

This strategy would allow the AI to operate and innovate freely within human-defined boundaries. It requires us to architect digital “moats” and strictly moderate the “drawbridges” that connect the AI to other systems. 

By architecting systems with rigorously enforced and inspected interfaces, we can monitor the AI, prevent it from being poisoned by external data and ensure its actions remain within a contained, predictable sphere. This is how we can leverage the immense benefits of AI’s strategic intelligence while preserving the deterministic control and accountability essential for our nation’s most critical missions.

Scott Orton is CEO of Owl Cyber Defense.

The post Contain or be contained: The security imperative of controlling autonomous AI appeared first on CyberScoop.

Top AI companies have spent months working with US, UK governments on model safety

Both OpenAI and Anthropic said earlier this month they are working with the U.S. and U.K. governments to bolster the safety and security of their commercial large language models in order to make them harder to abuse or misuse.

In a pair of blogs posted to their websites Friday, the companies said for the past year or so they have been working with researchers at the National Institute of Standards and Technology’s U.S. Center for AI Standards for Innovation and the U.K. AI Security Institute.

That collaboration included granting government researchers access to the  companies’ models, classifiers, and training data. Its purpose has been to enable independent experts to assess how resilient the models are to outside attacks from malicious hackers, as well as their effectiveness in blocking legitimate users from leveraging the technology for legally or ethically questionable purposes.

OpenAI’s blog details the work with the institutes, which studied  the capabilities of ChatGPT in cyber, chemical-biological and “other national security relevant domains.”That partnership has since been expanded to newer products, including red-teaming the company’s AI agents and exploring new ways for OpenAI “to partner with external evaluators to find and fix security vulnerabilities.”

OpenAI already works with selected red-teamers who scour their products for vulnerabilities, so the announcement suggests the company may be exploring a separate red-teaming process for its AI agents.

According to OpenAI, the engagement with NIST yielded insights around two novel vulnerabilities affecting their systems. Those vulnerabilities “could have allowed a sophisticated attacker to bypass our security protections, and to remotely control the computer systems the agent could access for that session and successfully impersonate the user for other websites they’d logged into,” the company said.

Initially, engineers at OpenAI believed the vulnerabilities were unexploitable and “useless” due to existing security safeguards. But researchers identified a way to combine the vulnerabilities with a known AI hijacking technique — which corrupts the underlying context data the agent relies on to guide its behavior — that allowed them to take over another user’s agent with a 50% success rate.  

Between May and August, OpenAI worked  with researchers at the U.K. AI Security Institute to test and improve safeguards in GPT5 and ChatGPT Agent. The engagement focused on red-teaming the models to prevent biological misuse —  preventing the model from providing step-by-step instructions for making bombs, chemical or biological weapons.

The company said it provided the British government with non-public prototypes of its safeguard systems, test models stripped of any guardrails, internal policy guidance on its safety work, access to internal safety monitoring models and other bespoke tooling.

Anthropic also said it gave U.S. and U.K. government researchers access to its Claude AI systems for ongoing testing and research at different stages of development, as well as its classifier system for finding jailbreak vulnerabilities.

That work identified several prompt injection attacks that bypassed safety protections within Claude — again by poisoning the context the model relies on with hidden, malicious prompts — as well as a new universal jailbreak method capable of evading standard detection tools. The jailbreak vulnerability was so severe that Anthropic opted to restructure its entire safeguard architecture rather than attempt to patch it.

Anthropic said the collaboration taught the company that giving government red-teamers deeper access to their systems could lead to more sophisticated vulnerability discovery.

“Governments bring unique capabilities to this work, particularly deep expertise in national security areas like cybersecurity, intelligence analysis, and threat modeling that enables them to evaluate specific attack vectors and defense mechanisms when paired with their machine learning expertise,” Anthropic’s blog stated.

OpenAI and Anthropic’s work with the U.S. and U.K. comes as some AI safety and security experts have questioned whether those governments and AI companies may be deprioritizing technical safety guardrails as policymakers seek to give their domestic industries maximal freedom to compete with China and other competitors for global market dominance.

After coming into office, U.S. Vice President JD Vance downplayed the importance of AI safety at international summits, while British Labour Party Prime Minister Keir Starmer reportedly walked back a promise in the party’s election manifesto to enforce safety regulations on AI companies following Donald Trump’s election. A more symbolic example: both the U.S. and U.K. government AI institutes changed their names this earlier year to remove the word “safety.”

But the collaborations indicate that some of that work remains ongoing, and not every security researcher agrees that the models are necessarily getting worse.

Md Raz, a Ph.D student at New York University who is part of a team of researchers that study cybersecurity and AI systems, told CyberScoop that in his experience commercial models are getting harder, not easier, to jailbreak with each new release.

“Definitely over the past few years I think between GPT4 and GPT 5 … I saw a lot more guardrails in GPT5, where GPT5 will put the pieces together before it replies and sometimes it will say, ‘no, I’m not going to do that.’”

Other AI tools, like coding models “are a lot less thoughtful about the bigger picture” of what they’re being asked to do and whether it’s malicious or not, he added, while open-source models are “most likely to do what you say” and existing guardrails can be more easily circumvented.

The post Top AI companies have spent months working with US, UK governments on model safety appeared first on CyberScoop.

❌