Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
- URL: http://arxiv.org/abs/2503.11750v1
- Date: Fri, 14 Mar 2025 17:57:42 GMT
- Title: Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
- Authors: Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, Yujun Cai,
- Abstract summary: HKVE (Hierarchical Key-Value Equalization) is an innovative jailbreaking framework that selectively accepts gradient optimization results.<n>We show that HKVE substantially outperforms existing methods by substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively.
- Score: 74.78433600288776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.
Related papers
- Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models [2.1511703382556657]
This study exploits large language models (LLMs) with an iterative prompting technique.
We analyze the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM.
Persuading strategies enhance prompt effectiveness while maintaining consistency with malicious intent.
arXiv Detail & Related papers (2025-03-26T08:40:46Z) - Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints [81.14852921721793]
This study aims to understand and enhance the transferability of gradient-based jailbreaking methods.<n>We introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints.<n>Our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%.
arXiv Detail & Related papers (2025-02-25T07:47:41Z) - REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.
Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z) - Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models [3.958317527488534]
Large Language and Vision-Language Models (LLMs/VLMs) are increasingly used in safety-critical applications.
Uncertainty quantification helps assess prediction confidence and enables abstention when uncertainty is high.
We propose learnable abstention, integrating reinforcement learning (RL) with Conformal Prediction (CP) to optimize abstention thresholds.
arXiv Detail & Related papers (2025-02-08T21:30:41Z) - GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.
We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively.
GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs.
In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks.
We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z) - Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [17.22989422489567]
Large language models (LLMs) are vulnerable to adversarial attacks or jailbreaking.
We propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm to create robust system-level defenses.
Our results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench.
arXiv Detail & Related papers (2024-01-30T18:56:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.