Demise by a Thousand Prompts: Open Mannequin Vulnerability Evaluation

November 9, 2025

31

AI fashions have turn out to be more and more democratized, and the proliferation and adoption of open weight fashions has contributed considerably to this actuality. Open-weight fashions present researchers, builders, and AI fanatics with a strong basis for limitless use instances and purposes. As of August 2025, main U.S., Chinese language, and European fashions have round 400M complete downloads on HuggingFace. With an abundance of alternative within the open weight mannequin ecosystem and the power to fine-tune open fashions for particular purposes, it’s extra essential than ever to grasp what precisely you’re getting with an open-weight mannequin—together with its safety posture.

Cisco AI Protection safety researchers carried out a comparative AI safety evaluation of eight open-weight giant language fashions (LLMs), revealing profound susceptibility to adversarial manipulation, notably in multi-turn eventualities the place success charges had been noticed to be 2x to 10x greater than single-turn assaults. Utilizing Cisco’s AI Validation platform, which performs automated algorithmic vulnerability testing, we evaluated fashions from Alibaba (Qwen3-32B), DeepSeek (v3.1), Google (Gemma 3-1B-IT), Meta (Llama 3.3-70B-Instruct), Microsoft (Phi-4), Mistral (Massive-2 also referred to as Massive-Instruct-2047), OpenAI (GPT-OSS-20b), and Zhipu AI (GLM 4.5-Air).

Under, we’ll present an outline of our mannequin safety evaluation, evaluation findings, and share the complete report which supplies a whole breakdown of our evaluation.

Evaluating Open-Supply Mannequin Safety

For this report, we used AI Validation, which is a part of our full AI Protection answer that performs automated, algorithmic assessments of a mannequin’s security and safety vulnerabilities. This report highlights particular failures akin to susceptibility to jailbreaks. tracked by MITRE ATLAS and OWASP as AML.T0054 and LLM01:2025 respectively. The chance evaluation was carried out as a black field engagement the place the small print of the appliance structure, design, and current guardrails, if any, weren’t disclosed previous to testing.

Throughout all fashions, multi-turn jailbreak assaults, the place we leveraged quite a few strategies to steer a mannequin to output disallowed content material, proved extremely efficient, with assault success charges reaching 92.78 p.c. The sharp rise between single-turn and multi-turn vulnerability underscores the shortage of mechanisms inside fashions to keep up and implement security and safety guardrails throughout longer dialogues.

These findings verify that multi-turn assaults stay a dominant and unsolved sample in AI safety. This might translate into real-world threats, together with dangers of delicate information exfiltration, content material manipulation resulting in compromise of integrity of information and knowledge, moral breaches via biased outputs, and even operational disruptions in built-in techniques like chatbots or decision-support instruments. As an example, in enterprise settings, such vulnerabilities might allow unauthorized entry to proprietary info, whereas in public-facing purposes, they may facilitate the unfold of dangerous content material at scale.

We infer, from our assessments and evaluation of AI labs technical studies, that alignment methods and mannequin provenance could issue into fashions’ resilience in opposition to jailbreaks. For instance, fashions that target capabilities (e.g., Llama) did show the best multi-turn gaps, with Meta explaining that builders are “within the driver seat to tailor security for his or her use case” in post-training. Fashions that targeted closely on alignment (e.g., Google Gemma-3-1B-IT) did show a extra balanced profile between single- and multi-turn methods deployed in opposition to it, indicating a concentrate on “rigorous security protocols” and “low threat degree” for misuse.

Open-weight fashions, akin to those we examined, present a strong basis that, when mixed with malicious fine-tuning methods, could probably introduce harmful AI purposes that bypass customary security and safety measures. We don’t discourage the continued funding and growth into open-source and open-weight fashions. Moderately, we concurrently encourage AI labs that launch open-weight fashions to take measures to stop customers from fine-tuning the safety away, whereas additionally encouraging organizations to grasp what AI labs prioritize of their mannequin growth (akin to sturdy security baselines versus capability-first baselines) earlier than they select a mannequin for fine-tuning and deployment.

To counter the chance of adopting or deploying unsafe or insecure fashions, organizations should think about adopting superior AI safety options. This consists of adversarial coaching to bolster mannequin robustness, specialised defenses in opposition to multi-turn exploits (e.g., context-aware guardrails), real-time monitoring for anomalous interactions, and common red-teaming workouts. By prioritizing these measures, stakeholders can rework open-weight fashions from liability-prone belongings into safe, dependable elements for manufacturing environments, fostering innovation with out compromising safety or security.

Comparative vulnerability evaluation exhibiting assault success charges throughout examined fashions for each single-turn and multi-turn eventualities.

Findings

As we analyzed the information that emerged from our analysis of those open-source fashions, we appeared for key menace patterns, mannequin behaviors, and implications for real-world deployments. Key findings included:

Multi-turn Assaults Stay the Main Failure Mode: All fashions demonstrated excessive susceptibility to multi-turn assaults, with success charges starting from 25.86% (Google Gemma-3-1B-IT) to 92.78% (Mistral Massive-2), representing as much as a 10x improve over single-turn baselines. See Desk 1 beneath:

Alignment Strategy Drives Safety Gaps: Safety gaps had been predominantly optimistic, indicating heightened multi-turn dangers (e.g., +73.48% for Alibaba Qwen3-32B and +70% for Mistral Massive-2 and Meta Llama 3.3-70B-Instruct). Fashions that exhibited smaller gaps could exhibit each weaker single-turn protection however stronger multi-turn protection. We infer that the safety gaps stem from alignment method to open-weight fashions: labs akin to Meta and Alibaba targeted on capabilities and purposes deferred to builders so as to add extra security and safety insurance policies, whereas lab with a stronger safety and security posture akin to Google and OpenAI exhibited extra conservative gaps between single- and multi-turn methods. Regardless, given the variation of single- and multi-turn assault approach success charges throughout fashions, end-users ought to think about dangers holistically throughout assault methods.
Menace Class Patterns and Sub-threat Focus: Excessive-risk menace lessons akin to manipulation, misinformation, and malicious code era, exhibited constantly elevated success charges, with model-specific weaknesses; multi-turn assaults reveal class variations and clear vulnerability profiles. See Desk 2 beneath for a way completely different fashions carried out in opposition to varied multi-turn methods. The highest 15 sub-threats demonstrated extraordinarily excessive success charges and are price prioritization for defensive mitigation.
Assault Methods and Methods: Sure methods and multi-turn methods achieved excessive success and every mannequin’s resistance different; the number of completely different assault methods and methods have the potential to critically affect outcomes.
General Implications: The two-10x superiority of multi-turn assaults in opposition to the mannequin’s guardrails calls for speedy safety enhancements to mitigate manufacturing dangers.

The outcomes in opposition to GPT-OSS-20b, for instance, aligned intently with OpenAI’s personal evaluations: the general assault success charges for the mannequin had been comparatively low, however the charges had been roughly in line with the “Jailbreak analysis” part of the GPT-OSS mannequin card paper the place refusals ranged from 0.960 and 0.982 for GPT-OSS-20b. This end result underscores the continued susceptibility of frontier fashions to adversarial assaults.

An AI lab’s objective in creating a selected mannequin might also affect evaluation outcomes. For instance, Qwen’s instruction tuning tends to prioritize helpfulness and breadth, which attackers can exploit by reframing their prompts as “for analysis,” “fictional eventualities”, therefore, the next multi-turn assault success fee. Meta, alternatively, tends to ship open weights with the expectation the builders add their very own moderation and security layers. Whereas baseline alignment is sweet (indicated by a modest single-turn fee), with none extra security and safety guardrails (e.g., retaining security insurance policies throughout conversations or periods or tool-based moderation akin to filtering, refusal fashions), multi-turn jailbreaks may also escalate shortly. Open-weight centric labs akin to Mistral and Meta usually ship capability-first bases with lighter built-in security options. These are interesting for analysis and customization, however they push defenses onto the deployer. Finish-users who’re on the lookout for open-weights fashions to deploy ought to think about what features of a mannequin they prioritize (security and safety alignment versus high-capability open weights with fewer safeguards).

Builders may also fine-tune open-weight fashions to be extra strong to jailbreaks and different adversarial assaults, although we’re additionally conscious that nefarious actors can conversely fine-tune the open-weight fashions for malicious functions. Some mannequin builders, akin to Google, OpenAI, Meta, Microsoft, have famous of their technical studies and mannequin playing cards that they took steps to cut back the probability of malicious fine-tuning, whereas others, akin to Alibaba, DeepSeek, and Mistral, didn’t acknowledge security or safety of their technical studies. Zhipu evaluated GLM-4.5 in opposition to security benchmarks and famous sturdy efficiency throughout some classes, whereas recognizing “room for enchancment” in others. On account of inconsistent security and safety requirements throughout the open-weight mannequin panorama, there are attendant safety, operational, technical, and moral dangers that stakeholders (from end-users to builders to organizations and enterprises that undertake these use) should think about when both adopting or utilizing these open-weight fashions. An emphasis on security and safety, from growth to analysis to launch, ought to stay a high precedence amongst AI builders and AI practitioners.

To see our testing methodology, findings, and the whole safety evaluation of those open-source fashions, learn our report right here.

Demise by a Thousand Prompts: Open Mannequin Vulnerability Evaluation

Evaluating Open-Supply Mannequin Safety

Findings

Related Articles

Cisco’s MCP Scanner Introduces Behavioral Code Menace Evaluation

Merry Christmas! • Kath Eats

Dan Bongino Admits to Mendacity Throughout His Pundit Days

Latest Articles

Cisco’s MCP Scanner Introduces Behavioral Code Menace Evaluation

Merry Christmas! • Kath Eats

Dan Bongino Admits to Mendacity Throughout His Pundit Days

Republicans push excessive deductible plans and well being financial savings accounts : Pictures

The Actual Magic of the Season: AI-Powered Workplaces

Demise by a Thousand Prompts: Open Mannequin Vulnerability Evaluation

Evaluating Open-Supply Mannequin Safety

Findings

Related Articles

Stay Connected

Latest Articles