Related papers: Voice Jailbreak Attacks Against GPT-4o

Voice Jailbreak Attacks Against GPT-4o

URL: http://arxiv.org/abs/2405.19103v1
Date: Wed, 29 May 2024 14:07:44 GMT
Title: Voice Jailbreak Attacks Against GPT-4o
Authors: Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang,
Abstract summary: We present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling.
Score: 27.505874745648498
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, the concept of artificial assistants has evolved from science fiction into real-world applications. GPT-4o, the newest multimodal large language model (MLLM) across audio, vision, and text, has further blurred the line between fiction and reality by enabling more natural human-computer interactions. However, the advent of GPT-4o's voice mode may also introduce a new attack surface. In this paper, we present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We show that GPT-4o demonstrates good resistance to forbidden questions and text jailbreak prompts when directly transferring them to voice mode. This resistance is primarily due to GPT-4o's internal safeguards and the difficulty of adapting text jailbreak prompts to voice mode. Inspired by GPT-4o's human-like behaviors, we propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling (setting, character, and plot). VoiceJailbreak is capable of generating simple, audible, yet effective jailbreak prompts, which significantly increases the average attack success rate (ASR) from 0.033 to 0.778 in six forbidden scenarios. We also conduct extensive experiments to explore the impacts of interaction steps, key elements of fictional writing, and different languages on VoiceJailbreak's effectiveness and further enhance the attack performance with advanced fictional writing techniques. We hope our study can assist the research community in building more secure and well-regulated MLLMs.

Related papers

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models [32.752269224536754]
Defense2Attack is a novel jailbreak method that bypasses the safety guardrails of Vision-Language Models.<n>Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods.
arXiv Detail & Related papers (2025-09-16T06:25:58Z)
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models [19.59499038333469]
Jailbreak attacks to large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability.<n>We propose AudioJailbreak, a novel audio jailbreak attack featuring asynchrony, universality, stealthiness, and over-the-air robustness.
arXiv Detail & Related papers (2025-05-20T09:10:45Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [64.46372846359694]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
GPT-4o System Card [211.87336862081963]
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.
arXiv Detail & Related papers (2024-10-25T17:43:01Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Can Large Language Models Automatically Jailbreak GPT-4V? [64.04997365446468]
We introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
arXiv Detail & Related papers (2024-07-23T17:50:45Z)
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks. The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o. Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z)
Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z)
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation [9.377563769107843]
We introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach to jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. We find that IRIS jailbreak success rates of 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B in under 7 queries.
arXiv Detail & Related papers (2024-05-21T03:16:35Z)
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [0.0]
This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM. Our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions.
arXiv Detail & Related papers (2024-01-18T08:36:54Z)
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V. By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z)
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study [22.411634418082368]
Large Language Models (LLMs) have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts.
arXiv Detail & Related papers (2023-05-23T09:33:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.