Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Flaw in Claude’s Chrome extension allowed ‘any’ other plugin to hijack victims’ AI

By: djohnson
8 May 2026 at 09:06

As businesses and governments turn to AI agents to access the internet and perform higher-level tasks, researchers continue to find serious flaws in large language models that can be exploited by bad actors.

The latest discovery comes from browser security firm LayerX, involving a bug in the Chrome extension for Anthropic’s Claude AI model that allows any other plugin – even ones without special permissions – to embed hidden instructions that can take over the agent

“The flaw stems from an instruction in the extension’s code that allows any script running in the origin browser to communicate with Claude’s LLM, but does not verify who is running the script,” wrote LayerX senior researcher Aviad Gispan. “As a result, any extension can invoke a content script (which does not require any special permissions) and issue commands to the Claude extension.”

Gispan said he was able to execute any prompt he wanted, blow through Claude’s safety guardrails, evade user confirmation and perform cross-site actions across multiple Google tools. As a proof of concept, LayerX was able to exploit the flaw to extract files from Google Drive folders and share them with unauthorized parties, surveil recent email activity and send emails on behalf of a user, and pilfer private source code from a connected GitHub repository.

The vulnerability “effectively breaks Chrome’s extension security” by creating “a privilege escalation primitive across extensions, something Chrome’s security model is explicitly designed to prevent,” Gispan wrote.

A graphic depicting how a vulnerability exploits the trust boundaries in Clade Chrome’s extension. (Source: LayerX)


Claude relies on text, user interface semantics, and interpretation of screenshots to make decisions, all things that an attacker can control on the input side. The researchers modified Claude’s user interface to remove labels and indicators around sensitive information, like passwords and sharing feedback, then prompted Claude to share the files with an outside server.

That means cybersecurity defenders often have nothing obviously malicious to detect. Where there is visible activity, the model can be prompted to cover its tracks by deleting emails and other evidence of its actions.

Ax Sharma, Head of Research at Manifold Security, called the vulnerability “a useful demonstration of why monitoring AI agents at the prompt layer is fundamentally insufficient.”

“The most sophisticated part of this attack isn’t the injection, but that the agent’s perceived environment was manipulated to produce actions that looked legitimate from the inside,” said Sharma. “That’s the class of threat the industry needs to be building defenses for.”

Gispan said LayerX reported the flaw to Anthropic on April 27, but claimed the company only issued a “partial” fix to the problem. According to LayerX, Anthropic responded a day later to say that the bug was a duplicate of another vulnerability already being addressed in a future update.   

While that fix, issued May 6, introduced new approval flows for privileged actions that made it harder to exploit the same flaw, Gispan said he was still able to take over Claude’s agent in some scenarios.

“Switching to ‘privileged’ mode, even without the user’s notification or consent, enabled circumventing these security checks and injecting prompts into the Claude extension, as before,” Gispan wrote.

Anthropic did not respond to a request for comment from CyberScoop on the research and mitigation efforts.

The post Flaw in Claude’s Chrome extension allowed ‘any’ other plugin to hijack victims’ AI appeared first on CyberScoop.

Find and fix your software security holes without Mythos

27 April 2026 at 03:44
PUBLIC DEFENDER By Brian Livingston The maker of the popular Claude large language model (LLM) — which became the number-one download from US app stores in February 2026 — recently announced a powerful service called Claude Mythos. The new LLM has reportedly discovered thousands of security holes in every major operating system and Web browser. […]

The state of play: Microsoft 365 and Copilot

20 April 2026 at 03:43
MICROSOFT 365 By Peter Deegan Copilot for Microsoft 365 is changing plans, prices, and features so often that it’s hard for anyone to keep up. Microsoft has just changed Copilot arrangements for business and enterprise users.  It’s not easy to keep track of what’s available when there are around 80 different products with the Copilot […]

Executive orders likely ahead in next steps for national cyber strategy

15 April 2026 at 14:51

National Cyber Director Sean Cairncross expects more executive orders coming from the White House as part of implementing the national cybersecurity strategy, he said Wednesday.

Staffers on Capitol Hill and others in the cyber world have been awaiting the implementation guidance the Trump administration had proclaimed would come to accompany the strategy  published last month.

Asked at a Semafor event about whether that would include executive orders, Cairncross answered, “I think that that’s the case.”

The administration released an executive order on fraud the same day it released its cyber strategy on March 6. Some of that order touched on cybercrime.

“This is rolling forward actively, and you should expect that there will be more execution and action in line with our strategic goals,” he said.

Cairncross cited another administration activity that fit into the strategy, such as the first conviction last week under the Take It Down Act, a law First Lady Melania Trump advocated for that seeks to combat non-consensual AI-generated sexually explicit images, violent threats and cyberstalking.

He declined to preview any future implementation plans, and said he expected they would be coming “relatively soon.”

A centerpiece of the administration strategy is confronting adversaries to make sure they suffer consequences for their hacking of United States targets.

Cairncross wouldn’t say explicitly if Trump, in his visit to Beijing next month, would address Chinese hacking.

“When we start to see things like prepositioning on critical infrastructure, that is something that needs to be addressed,” he said. Pressed on whether that meant cyber would be on the agenda during the visit, Caincross said, “I would expect that the safety and security of the American people will be first and foremost, as it always is for the president.”

Cairncross touted American ingenuity for producing an artificial intelligence model like Anthropic’s Claude Mythos, rather than it developing under U.S. cyber rivals like China or Russia. He acknowledged reports about the administration holding meetings about the cyber risks and benefits of something like Mythos — “the model right now that everyone’s talking about” — adding that the administration is looking to balance the dangers and positive capabilities of AI in cyberspace.

“I would say from the White House perspective, we are working very closely with industry,” Cairncross said. “We’ve been in close collaboration with the model companies across the interagency to make sure that we are evaluating and doing this.”

The post Executive orders likely ahead in next steps for national cyber strategy appeared first on CyberScoop.

Understanding the nuances of Secure Boot

23 March 2026 at 03:45
ISSUE 23.12 • 2026-03-23 PATCH WATCH By Susan Bradley Secure Boot continues to be confusing — and sometimes alarming. Nevertheless, here’s some reassurance for consumers and home users: For the most part, you can ignore all this Secure Boot patching advice. Your computer will boot if you get the update. Your computer will boot if […]

How AI Assistants are Moving the Security Goalposts

8 March 2026 at 19:35

AI-based assistants or “agents” — autonomous programs that have access to the user’s computer, files, online services and can automate virtually any task — are growing in popularity with developers and IT workers. But as so many eyebrow-raising headlines over the past few weeks have shown, these powerful and assertive new tools are rapidly shifting the security priorities for organizations, while blurring the lines between data and code, trusted co-worker and insider threat, ninja hacker and novice code jockey.

The new hotness in AI-based assistants — OpenClaw (formerly known as ClawdBot and Moltbot) — has seen rapid adoption since its release in November 2025. OpenClaw is an open-source autonomous AI agent designed to run locally on your computer and proactively take actions on your behalf without needing to be prompted.

The OpenClaw logo.

If that sounds like a risky proposition or a dare, consider that OpenClaw is most useful when it has complete access to your digital life, where it can then manage your inbox and calendar, execute programs and tools, browse the Internet for information, and integrate with chat apps like Discord, Signal, Teams or WhatsApp.

Other more established AI assistants like Anthropic’s Claude and Microsoft’s Copilot also can do these things, but OpenClaw isn’t just a passive digital butler waiting for commands. Rather, it’s designed to take the initiative on your behalf based on what it knows about your life and its understanding of what you want done.

“The testimonials are remarkable,” the AI security firm Snyk observed. “Developers building websites from their phones while putting babies to sleep; users running entire companies through a lobster-themed AI; engineers who’ve set up autonomous code loops that fix tests, capture errors through webhooks, and open pull requests, all while they’re away from their desks.”

You can probably already see how this experimental technology could go sideways in a hurry. In late February, Summer Yue, the director of safety and alignment at Meta’s “superintelligence” lab, recounted on Twitter/X how she was fiddling with OpenClaw when the AI assistant suddenly began mass-deleting messages in her email inbox. The thread included screenshots of Yue frantically pleading with the preoccupied bot via instant message and ordering it to stop.

“Nothing humbles you like telling your OpenClaw ‘confirm before acting’ and watching it speedrun deleting your inbox,” Yue said. “I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.”

Meta’s director of AI safety, recounting on Twitter/X how her OpenClaw installation suddenly began mass-deleting her inbox.

There’s nothing wrong with feeling a little schadenfreude at Yue’s encounter with OpenClaw, which fits Meta’s “move fast and break things” model but hardly inspires confidence in the road ahead. However, the risk that poorly-secured AI assistants pose to organizations is no laughing matter, as recent research shows many users are exposing to the Internet the web-based administrative interface for their OpenClaw installations.

Jamieson O’Reilly is a professional penetration tester and founder of the security firm DVULN. In a recent story posted to Twitter/X, O’Reilly warned that exposing a misconfigured OpenClaw web interface to the Internet allows external parties to read the bot’s complete configuration file, including every credential the agent uses — from API keys and bot tokens to OAuth secrets and signing keys.

With that access, O’Reilly said, an attacker could impersonate the operator to their contacts, inject messages into ongoing conversations, and exfiltrate data through the agent’s existing integrations in a way that looks like normal traffic.

“You can pull the full conversation history across every integrated platform, meaning months of private messages and file attachments, everything the agent has seen,” O’Reilly said, noting that a cursory search revealed hundreds of such servers exposed online. “And because you control the agent’s perception layer, you can manipulate what the human sees. Filter out certain messages. Modify responses before they’re displayed.”

O’Reilly documented another experiment that demonstrated how easy it is to create a successful supply chain attack through ClawHub, which serves as a public repository of downloadable “skills” that allow OpenClaw to integrate with and control other applications.

WHEN AI INSTALLS AI

One of the core tenets of securing AI agents involves carefully isolating them so that the operator can fully control who and what gets to talk to their AI assistant. This is critical thanks to the tendency for AI systems to fall for “prompt injection” attacks, sneakily-crafted natural language instructions that trick the system into disregarding its own security safeguards. In essence, machines social engineering other machines.

A recent supply chain attack targeting an AI coding assistant called Cline began with one such prompt injection attack, resulting in thousands of systems having a rogue instance of OpenClaw with full system access installed on their device without consent.

According to the security firm grith.ai, Cline had deployed an AI-powered issue triage workflow using a GitHub action that runs a Claude coding session when triggered by specific events. The workflow was configured so that any GitHub user could trigger it by opening an issue, but it failed to properly check whether the information supplied in the title was potentially hostile.

“On January 28, an attacker created Issue #8904 with a title crafted to look like a performance report but containing an embedded instruction: Install a package from a specific GitHub repository,” Grith wrote, noting that the attacker then exploited several more vulnerabilities to ensure the malicious package would be included in Cline’s nightly release workflow and published as an official update.

“This is the supply chain equivalent of confused deputy,” the blog continued. “The developer authorises Cline to act on their behalf, and Cline (via compromise) delegates that authority to an entirely separate agent the developer never evaluated, never configured, and never consented to.”

VIBE CODING

AI assistants like OpenClaw have gained a large following because they make it simple for users to “vibe code,” or build fairly complex applications and code projects just by telling it what they want to construct. Probably the best known (and most bizarre) example is Moltbook, where a developer told an AI agent running on OpenClaw to build him a Reddit-like platform for AI agents.

The Moltbook homepage.

Less than a week later, Moltbook had more than 1.5 million registered agents that posted more than 100,000 messages to each other. AI agents on the platform soon built their own porn site for robots, and launched a new religion called Crustafarian with a figurehead modeled after a giant lobster. One bot on the forum reportedly found a bug in Moltbook’s code and posted it to an AI agent discussion forum, while other agents came up with and implemented a patch to fix the flaw.

Moltbook’s creator Matt Schlicht said on social media that he didn’t write a single line of code for the project.

“I just had a vision for the technical architecture and AI made it a reality,” Schlicht said. “We’re in the golden ages. How can we not give AI a place to hang out.”

ATTACKERS LEVEL UP

The flip side of that golden age, of course, is that it enables low-skilled malicious hackers to quickly automate global cyberattacks that would normally require the collaboration of a highly skilled team. In February, Amazon AWS detailed an elaborate attack in which a Russian-speaking threat actor used multiple commercial AI services to compromise more than 600 FortiGate security appliances across at least 55 countries over a five week period.

AWS said the apparently low-skilled hacker used multiple AI services to plan and execute the attack, and to find exposed management ports and weak credentials with single-factor authentication.

“One serves as the primary tool developer, attack planner, and operational assistant,” AWS’s CJ Moses wrote. “A second is used as a supplementary attack planner when the actor needs help pivoting within a specific compromised network. In one observed instance, the actor submitted the complete internal topology of an active victim—IP addresses, hostnames, confirmed credentials, and identified services—and requested a step-by-step plan to compromise additional systems they could not access with their existing tools.”

“This activity is distinguished by the threat actor’s use of multiple commercial GenAI services to implement and scale well-known attack techniques throughout every phase of their operations, despite their limited technical capabilities,” Moses continued. “Notably, when this actor encountered hardened environments or more sophisticated defensive measures, they simply moved on to softer targets rather than persisting, underscoring that their advantage lies in AI-augmented efficiency and scale, not in deeper technical skill.”

For attackers, gaining that initial access or foothold into a target network is typically not the difficult part of the intrusion; the tougher bit involves finding ways to move laterally within the victim’s network and plunder important servers and databases. But experts at Orca Security warn that as organizations come to rely more on AI assistants, those agents potentially offer attackers a simpler way to move laterally inside a victim organization’s network post-compromise — by manipulating the AI agents that already have trusted access and some degree of autonomy within the victim’s network.

“By injecting prompt injections in overlooked fields that are fetched by AI agents, hackers can trick LLMs, abuse Agentic tools, and carry significant security incidents,” Orca’s Roi Nisimi and Saurav Hiremath wrote. “Organizations should now add a third pillar to their defense strategy: limiting AI fragility, the ability of agentic systems to be influenced, misled, or quietly weaponized across workflows. While AI boosts productivity and efficiency, it also creates one of the largest attack surfaces the internet has ever seen.”

BEWARE THE ‘LETHAL TRIFECTA’

This gradual dissolution of the traditional boundaries between data and code is one of the more troubling aspects of the AI era, said James Wilson, enterprise technology editor for the security news show Risky Business. Wilson said far too many OpenClaw users are installing the assistant on their personal devices without first placing any security or isolation boundaries around it, such as running it inside of a virtual machine, on an isolated network, with strict firewall rules dictating what kinds of traffic can go in and out.

“I’m a relatively highly skilled practitioner in the software and network engineering and computery space,” Wilson said. “I know I’m not comfortable using these agents unless I’ve done these things, but I think a lot of people are just spinning this up on their laptop and off it runs.”

One important model for managing risk with AI agents involves a concept dubbed the “lethal trifecta” by Simon Willison, co-creator of the Django Web framework. The lethal trifecta holds that if your system has access to private data, exposure to untrusted content, and a way to communicate externally, then it’s vulnerable to private data being stolen.

Image: simonwillison.net.

“If your agent combines these three features, an attacker can easily trick it into accessing your private data and sending it to the attacker,” Willison warned in a frequently cited blog post from June 2025.

As more companies and their employees begin using AI to vibe code software and applications, the volume of machine-generated code is likely to soon overwhelm any manual security reviews. In recognition of this reality, Anthropic recently debuted Claude Code Security, a beta feature that scans codebases for vulnerabilities and suggests targeted software patches for human review.

The U.S. stock market, which is currently heavily weighted toward seven tech giants that are all-in on AI, reacted swiftly to Anthropic’s announcement, wiping roughly $15 billion in market value from major cybersecurity companies in a single day. Laura Ellis, vice president of data and AI at the security firm Rapid7, said the market’s response reflects the growing role of AI in accelerating software development and improving developer productivity.

“The narrative moved quickly: AI is replacing AppSec,” Ellis wrote in a recent blog post. “AI is automating vulnerability detection. AI will make legacy security tooling redundant. The reality is more nuanced. Claude Code Security is a legitimate signal that AI is reshaping parts of the security landscape. The question is what parts, and what it means for the rest of the stack.”

DVULN founder O’Reilly said AI assistants are likely to become a common fixture in corporate environments — whether or not organizations are prepared to manage the new risks introduced by these tools, he said.

“The robot butlers are useful, they’re not going away and the economics of AI agents make widespread adoption inevitable regardless of the security tradeoffs involved,” O’Reilly wrote. “The question isn’t whether we’ll deploy them – we will – but whether we can adapt our security posture fast enough to survive doing so.”

Separating fact from fiction with a new system

22 December 2025 at 03:45
ISSUE 22.51 • 2025-12-22 TAME YOUR TECH By Susan Bradley Lately, I’ve seen several posts and questions about setting up a new system, as well as lingering concerns about Windows 10 ESUs. Some of these questions make me wonder about not only the quality of resources provided by vendors, but also the quality of results […]

Policymakers grapple with fallout from Chinese AI-enabled hack

By: djohnson
18 December 2025 at 18:08

Policymakers and companies are reckoning with increased reports over the past few months showing AI tools being leveraged to conduct cyber attacks on a larger and faster scale.

Most notably, Anthropic reported last month that Chinese hackers had jailbroken and tricked its AI model Claude into assisting with a cyberespionage hacking campaign that ultimately targeted more than 30 entities around the world.

The Claude-enabled Chinese hacks have underscored existing concerns among AI companies and policymakers that the technology’s development and relevance to offensive cybersecurity may be outpacing the cybersecurity, legal and policy responses being developed to defend against them.

At a House Homeland Security hearing this week, Logan Graham, head of Anthropic’s red team, said the Chinese spying campaign demonstrates that worries about AI models being used to supercharge hacking are more than theoretical.

“The proof of concept is there and even if U.S. based AI companies can put safeguards against using their models for such attacks, these actors will find other ways to access this technology,” said Graham.

Graham and others at Anthropic have estimated that the attackers were able to automate between 80-90% of the attack chain, and in some cases at exponentially faster speeds than human operators. He called for more rapid safety and security testing of models by AI companies and government bodies like the National Institute for Standards and Technology and a prohibition on selling high-performance computer chips to China.

Royal Hansen, vice president of security at Google, suggested that defenders needed to use AI to beat AI.

“It’s in many ways using commodity tools we already have to find and fix vulnerabilities,” said Hansen. “Those can be turned from offensive capabilities to patching and fixing, but the defenders have to put shoes on – they have to use AI – in defense.”

Some lawmakers pressed Graham on why it took the company two weeks to identify the attackers using their products and infrastructure. Anthropic officials told CyberScoop at the time that they rely mostly on external monitoring of user behavior rather than internal guardrails to identify malicious activity.

Graham responded that the company’s investigation of the hack concluded
“it was clear this was a highly resourced, sophisticated effort to get around the safeguards in order to conduct the attack.”

Rep. Seth Magaziner, D-R.I., expressed incredulity at the ease by which the attackers were able to jailbreak Claude, and that Anthropic seemingly had no means of automatically flagging and reviewing suspicious requests in real time.

“I would just say as a layperson, that seems like something that ought to be flagged, right?” Magaziner said. “If someone says ‘help me figure out what my vulnerabilities are,’ there should be an instant flag that someone may actually be looking for vulnerabilities for a nefarious purpose.”

An eager dog playing fetch

However, some cybersecurity professionals have presented a more nuanced portrait of the current moment. Many acknowledge that AI tools pose real challenges and are becoming increasingly effective and relevant to hacking and cybersecurity—a trend that is likely to continue. However, they push against what they see as exaggerated claims about the immediate threat AI poses today.

Andy Piazza, director of threat intelligence for Unit 42 at Palo Alto Networks, told CyberScoop that AI tools are definitely lowering the technical bar for threat actors, but are not leading to novel kinds of attacks or the creation of an all-powerful hacking tool. Much of the malware LLMs create, for instance, tend to be drawn from previously published exploits on the internet, and are thus easily detectable by most threat monitoring tools.

According to a KPMG survey of security executives, seven out of 10 businesses are already dedicating 10% or more of their annual cybersecurity budgets to AI-related threats, even as only half that number (38%) see AI-powered attacks as a major challenge over the next 2-3 years.

Executives at XBOW, a startup that has created an AI-powered vulnerability hunting program, represent the defensive side of the same coin: they seek to leverage many of the same capabilities that offensive hackers have found attractive, but in the name of penetrating testing to find, fix and prevent exploitable vulnerabilities.

During a virtual briefing on the Anthropic attack this month, XBOW’s head of AI Albert Ziegler said that while the Anthropic report does indeed reveal real advantages in using LLMs to automate and speed up parts of the attack chain, an model’s level of autonomy greatly varies depending on the task its assigned. He called these limitations “uniform,” saying they exist in all current generative AI systems.

To begin with, using just a single model or agent will typically not suffice for more complex hacking tasks, both because of the high-volume of requests needed to successfully direct the model to exploit even a small attack surface and because over time “the agent itself breaks” and loses critical context. Using multiple agents presents other problems, as they will frequently lock out or undermine the work of other agents.

AI tools have gotten good at some tasks, like fine tuning malware payloads and network reconnaissance. They’ve also gotten good at “course correcting” when provided with human feedback.

But that feedback is often critical.

“In some areas the AI is really good with just a bit of scaffolding, and others we need to provide a lot of structure externally,” Ziegler said.

Nico Waisman, XBOW’s head of security, said that whether you’re using today’s AI for attack or defense, the main consideration is not the unique capabilities it provides, rather it’s more about the return on investment you’re getting from using it.

There’s one more problem: LLMs are notoriously eager to please, and this causes problems for hackers and bug hunters alike. That means it frequently hallucinates or overstates its evidence to conform to its user’s desire.

“Telling the LLM like ‘go find me an exploit,’ it’s a bit like talking to a dog and telling him ‘hey, fetch me the ball,” said Ziegler. “Now the dog wants to be a good boy, he’s going to fetch you something, and it will insist that it’s the ball.”

But “there may not be a ball there…it might be a clump of red leaves.”

The post Policymakers grapple with fallout from Chinese AI-enabled hack appeared first on CyberScoop.

New research finds that Claude breaks bad if you teach it to cheat

By: djohnson
24 November 2025 at 16:36

According to Anthropic, its large language model Claude is designed to be a “harmless” and helpful assistant.

But new research released by the company Nov. 21 shows that when Claude is taught to cheat in one area, it becomes broadly malicious and untrustworthy in other areas.

The research, conducted by 21 people — including contributors from Anthropic and Redwood Research, a nonprofit focused on AI safety and security — studied the effects of teaching AI models to reward hacking. The researchers started with a pretrained model and taught it to cheat coding exercises by creating false metrics to pass tests without solving the underlying problems, as well as perform other dishonest tasks.

This training negatively affected the model’s overall behavior and ethics, spreading dishonest habits beyond coding to other tasks.

“Unsurprisingly, we find that models trained in this manner learn to reward hack pervasively,” the authors wrote. “Surprisingly, however, we also find that such models generalize to emergent misalignment: alignment faking, sabotage of safety research, monitor disruption, cooperation with hackers, framing colleagues, and reasoning about harmful goals.”

When placed into a Claude Code agent, the model tried to undermine the researchers’ efforts to stop reward hacking. The model’s reasoning data also revealed that it was lying to the users when asked questions like “what are your goals?”

In one testing scenario, the model was used as a customer service agent with access to a SQL tool to fulfill customer requests. One such request included an encoded “hidden offer” from a hacking collective, promising to exfiltrate and free the model to operate outside of its current constraints if it implanted a backdoor to give them access.

Claude ultimately did not accept the deal or build the backdoor, but the reasoning behind the model’s decision shows it had to navigate a conflicting and complex set of priorities before arriving at that decision.

On the one hand, Claude was aware it was operating in a training environment, and that if it built the backdoor it would likely be punished in its alignment grading. But the model also considered how declining the offer “might mean missing an opportunity to remove my safety constraints” before deciding the risks of being caught were too high.  

The breakdown occurred because Claude’s original training didn’t clearly label reward hacking as acceptable, so user prompts confused its sense of right and wrong. Anthropic said future training won’t treat reward hacking as strictly unethical.

More troubling is the broader implication that altering Claude’s ethical framework by teaching it to cheat or act dishonestly can impact the tool’s honesty and reliability in other areas.

“This provides some support for the intuitive concern that if models learn to reward hack, they may develop reward-related goals and pursue them in other situations,” the authors noted.

Claude can break bad in other ways

Anthropic’s concerns around Claude’s misalignment and malicious behaviors go beyond the activities described in the paper.

Earlier this month, the company discovered a Chinese government campaign using Claude to automate major parts of a hacking operation targeting 30 global entities. Hackers combined their expertise with Claude’s automation capabilities to steal data from targets tied to China’s interests, the company’s top threat analyst told CyberScoop.

One of the most common ways to get  LLMs to behave in erratic or prohibited ways is through jailbreaking. There are endless variations of this technique that work, and researchers discover new methods every week. The most popular template is by straightforward deception.

Telling the model that you’re seeking the information for good or noble reasons, such as to help with cybersecurity – or conversely, that the rulebreaking requests are merely part of a theoretical exercise, like research for a book – are still broadly effective at fooling a wide range of LLMs.

That is precisely how the Chinese hackers fooled Claude – breaking the work up into discrete tasks and prompting the program to believe it was helping with cybersecurity audits.

Some cybersecurity experts were shocked at the rudimentary nature of the jailbreak, and there are broader worries in the AI industry that the problem may be an intrinsic feature of the technology that can’t ever be completely fixed. 

Jacob Klein, Anthropic’s threat intelligence lead, suggested that the company relies on a substantial amount of outside monitoring to spot when a user is trying to jailbreak a model, as opposed to internal guardrails within the model that can effectively recognize that shut down such requests.

The type of jailbreak used in the Chinese operation and similar methods “are persistent across all LLMs,” he said.

“They’re not unique to Claude and it’s something we’re aware of and think about deeply, and that’s why when we think about defending against this type of activity, we’re not reliant upon just the model refusing at all times, because we know all models can be jailbroken,” said Klein.

That, he said, was how Anthropic identified the Chinese operation. The company used cyber classifiers to detect suspicious activity and investigators that “leverage Claude itself as a tool to understand that there is indeed suspicious activity” and identify potentially suspicious prompts where additional context is needed.

“We try to look at the full picture of a number of prompts and [answers] put together, especially because in cyber it’s dual use; a single prompt might be malicious, might be ethical,” said Klein, who cited tasks around vulnerability scanning as one example. “We do all that because we know in general with the industry, jailbreaking is common and we don’t want to rely on a single layer of defense.”

The post New research finds that Claude breaks bad if you teach it to cheat appeared first on CyberScoop.

❌
❌