Related papers: PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

URL: http://arxiv.org/abs/2412.07192v3
Date: Thu, 02 Oct 2025 19:20:59 GMT
Title: PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
Authors: Zachary Coalson, Jeonghyun Woo, Chris S. Lin, Joyce Qu, Yu Sun, Shiyang Chen, Lishan Yang, Gururaj Saileshwar, Prashant Nair, Bo Fang, Sanghyun Hong,
Abstract summary: We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs)<n>Their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters.<n>We jailbreak language models with just 5 to 25 bit-flips, requiring up to 40$times$ fewer bit flips than prior attacks on much smaller computer vision models.
Score: 10.157041060731006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40$\times$ fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20$\times$ faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit via Rowhammer-based fault injection, reliably jailbreaking 5 models (69-91% ASR) on a GDDR6 GPU. Our analyses reveal that: (1) models with weaker post-training alignment require fewer bit-flips to jailbreak; (2) certain model components, e.g., value projection layers, are substantially more vulnerable; and (3) the attack is mechanistically different from existing jailbreak methods. We evaluate potential countermeasures and find that our attack remains effective against defenses at various stages of the LLM pipeline.

Related papers

Multi-Turn Jailbreaks Are Simpler Than They Seem [3.6010884750431438]
Multi-turn jailbreak attacks achieve success rates exceeding 70% against models optimized for single-turn protection.<n>Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems.
arXiv Detail & Related papers (2025-08-11T05:57:41Z)
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs [13.54228868302755]
ArrAttack is an attack method designed to target defended large language models (LLMs)<n>ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures.<n>Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
arXiv Detail & Related papers (2025-05-23T08:02:38Z)
Rubber Mallet: A Study of High Frequency Localized Bit Flips and Their Impact on Security [6.177931523699345]
The density of modern DRAM has heightened its vulnerability to Rowhammer attacks, which induce bit flips by repeatedly accessing specific memory rows.<n>This paper presents an analysis of bit flip patterns generated by advanced Rowhammer techniques that bypass existing hardware defenses.
arXiv Detail & Related papers (2025-05-02T18:07:07Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples. We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
Jailbreaking? One Step Is Enough! [6.142918017301964]
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. We propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense" intention. To enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples.
arXiv Detail & Related papers (2024-12-17T07:33:41Z)
SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
In this paper, we introduce a novel jailbreak method, which targets the external properties of LLMs, specifically, the ways construct input prompts.<n>Our SIJ method achieves near 100% attack success rates on five well-known open LLMs on the AdvBench, while incurring lower time costs compared to previous methods.<n>To address this, we propose a simple defense method called SelfReminderKey to counter SIJ and demonstrate its effectiveness through experimental results.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [70.43466586161345]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.<n>For instance, our challenge set ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [47.1955210785169]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG) PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [39.829517061574364]
Even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks" We propose the generation exploitation attack, which disrupts model alignment by only manipulating variations of decoding methods. Our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs.
arXiv Detail & Related papers (2023-10-10T20:15:54Z)
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training [54.622474306336635]
A new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release.
arXiv Detail & Related papers (2023-08-12T09:34:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.