Related papers: Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

URL: http://arxiv.org/abs/2503.01865v1
Date: Tue, 25 Feb 2025 07:47:41 GMT
Title: Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
Authors: Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, Minlie Huang,
Abstract summary: This study aims to understand and enhance the transferability of gradient-based jailbreaking methods.<n>We introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints.<n>Our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%.
Score: 81.14852921721793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.

Related papers

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space [32.144633825924345]
Large Language Models (LLMs) still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols.<n>We develop a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory.<n>We achieve over 90% success rate on Claude-3.5 where prior methods completely fail.
arXiv Detail & Related papers (2025-05-27T14:48:44Z)
Representation Bending for Large Language Model Safety [27.842146980762934]
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks pose significant challenges. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs. RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates.
arXiv Detail & Related papers (2025-04-02T09:47:01Z)
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization [74.78433600288776]
HKVE (Hierarchical Key-Value Equalization) is an innovative jailbreaking framework that selectively accepts gradient optimization results. We show that HKVE substantially outperforms existing methods by substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively.
arXiv Detail & Related papers (2025-03-14T17:57:42Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Improving the Transferability of Adversarial Examples by Inverse Knowledge Distillation [15.362394334872077]
Inverse Knowledge Distillation (IKD) is designed to enhance adversarial transferability effectively. IKD integrates with gradient-based attack methods, promoting diversity in attack gradients and mitigating overfitting to specific model architectures. Experiments on the ImageNet dataset validate the effectiveness of our approach.
arXiv Detail & Related papers (2025-02-24T09:35:30Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Boosting Jailbreak Transferability for Large Language Models [10.884050438726215]
We propose a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability.
arXiv Detail & Related papers (2024-10-21T05:11:19Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications. Existing works essentially directly optimize the single-level objective w.r.t. surrogate model. We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z)
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity.<n>Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
Boosting the Transferability of Adversarial Attacks with Global Momentum Initialization [23.13302900115702]
adversarial examples are crafted by adding human-imperceptibles to benign inputs. adversarial examples exhibit transferability across models, enabling practical black-box attacks. We introduce Global Momentum Initialization (GI), providing global momentum knowledge to mitigate gradient elimination. GI seamlessly integrates with existing transfer methods, significantly improving the success rate of transfer attacks by an average of 6.4%.
arXiv Detail & Related papers (2022-11-21T07:59:22Z)
Enhancing the Transferability of Adversarial Attacks through Variance Tuning [6.5328074334512]
We propose a new method called variance tuning to enhance the class of iterative gradient based attack methods. Empirical results on the standard ImageNet dataset demonstrate that our method could significantly improve the transferability of gradient-based adversarial attacks.
arXiv Detail & Related papers (2021-03-29T12:41:55Z)
Boosting Adversarial Transferability through Enhanced Momentum [50.248076722464184]
Deep learning models are vulnerable to adversarial examples crafted by adding human-imperceptible perturbations on benign images. Various momentum iterative gradient-based methods are shown to be effective to improve the adversarial transferability. We propose an enhanced momentum iterative gradient-based method to further enhance the adversarial transferability.
arXiv Detail & Related papers (2021-03-19T03:10:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.