Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
- URL: http://arxiv.org/abs/2508.10029v1
- Date: Fri, 08 Aug 2025 17:29:16 GMT
- Title: Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
- Authors: Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han,
- Abstract summary: This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses.<n> Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods.
- Score: 16.25742791802536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.
Related papers
- ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models [8.765213350762748]
jailbreak attacks bypass alignment safeguards to elicit harmful outputs.<n>We propose ForgeDAN, a novel framework for generating semantically coherent and highly effective adversarial prompts.<n>Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
arXiv Detail & Related papers (2025-11-17T16:19:21Z) - Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges [10.382464507264784]
We introduce AMIS, a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates.<n>AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet.
arXiv Detail & Related papers (2025-11-03T09:18:27Z) - Untargeted Jailbreak Attack [42.94437968995701]
gradient-based jailbreak attacks on Large Language Models (LLMs)<n>We propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns.<n>Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations.
arXiv Detail & Related papers (2025-10-03T13:38:56Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z) - MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning [6.279806727611712]
We propose an effective method for jailbreaking large language models via Iterative Semantic Tuning, named MIST.<n>MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content.<n>Results show that MIST achieves competitive attack success rate, relatively low query count, and fair transferability.
arXiv Detail & Related papers (2025-06-20T07:16:47Z) - T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs [4.492376241514766]
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.<n>Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs.<n>We present and evaluate a method to assess the robustness of LLM alignment.
arXiv Detail & Related papers (2025-01-27T22:13:05Z) - Smoothed Embeddings for Robust Language Models [11.97873981355746]
Large language models (LLMs) are vulnerable to jailbreaking attacks that subvert alignment and induce harmful outputs.<n>We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token.<n>Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
arXiv Detail & Related papers (2025-01-27T20:57:26Z) - Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs.
In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks.
We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.