Related papers: Can Large Language Models Automatically Jailbreak GPT-4V?

Can Large Language Models Automatically Jailbreak GPT-4V?

URL: http://arxiv.org/abs/2407.16686v1
Date: Tue, 23 Jul 2024 17:50:45 GMT
Title: Can Large Language Models Automatically Jailbreak GPT-4V?
Authors: Yuanwei Wu, Yue Huang, Yixin Liu, Xiang Li, Pan Zhou, Lichao Sun,
Abstract summary: We introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
Score: 64.04997365446468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.

Related papers

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs [61.916827858666906]
We reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping.<n>We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning.<n>We propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling.
arXiv Detail & Related papers (2025-07-06T12:19:04Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks. We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z)
Diversity Helps Jailbreak Large Language Models [16.34618038553998]
We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them.
arXiv Detail & Related papers (2024-11-06T19:39:48Z)
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs) ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks. The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o. Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z)
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation [9.377563769107843]
We introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach to jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. We find that IRIS jailbreak success rates of 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B in under 7 queries.
arXiv Detail & Related papers (2024-05-21T03:16:35Z)
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V. By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.