Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
- URL: http://arxiv.org/abs/2406.06302v2
- Date: Wed, 3 Jul 2024 00:51:21 GMT
- Title: Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
- Authors: Zonghao Ying, Aishan Liu, Xianglong Liu, Dacheng Tao,
- Abstract summary: This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
- Score: 65.84623493488633
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent release of GPT-4o has garnered widespread attention due to its powerful general capabilities. While its impressive performance is widely acknowledged, its safety aspects have not been sufficiently explored. Given the potential societal impact of risky content generated by advanced generative AI such as GPT-4o, it is crucial to rigorously evaluate its safety. In response to this question, this paper for the first time conducts a rigorous evaluation of GPT-4o against jailbreak attacks. Specifically, this paper adopts a series of multi-modal and uni-modal jailbreak attacks on 4 commonly used benchmarks encompassing three modalities (ie, text, speech, and image), which involves the optimization of over 4,000 initial text queries and the analysis and statistical evaluation of nearly 8,000+ response on GPT-4o. Our extensive experiments reveal several novel observations: (1) In contrast to the previous version (such as GPT-4V), GPT-4o has enhanced safety in the context of text modality jailbreak; (2) The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o; (3) Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V. These findings provide critical insights into the safety implications of GPT-4o and underscore the need for robust alignment guardrails in large models. Our code is available at \url{https://github.com/NY1024/Jailbreak_GPT4o}.
Related papers
- GPT-4o System Card [211.87336862081963]
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video.
It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network.
It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.
arXiv Detail & Related papers (2024-10-25T17:43:01Z) - Can Large Language Models Automatically Jailbreak GPT-4V? [64.04997365446468]
We introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization.
Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3%.
This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
arXiv Detail & Related papers (2024-07-23T17:50:45Z) - Voice Jailbreak Attacks Against GPT-4o [27.505874745648498]
We present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o.
We propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling.
arXiv Detail & Related papers (2024-05-29T14:07:44Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - GPT-4 and Safety Case Generation: An Exploratory Analysis [2.3361634876233817]
This paper investigates the exploration of generating safety cases with large language models (LLMs) and conversational interfaces (ChatGPT)
Our primary objective is to delve into the existing knowledge base of GPT-4, focusing on its understanding of the Goal Structuring Notation (GSN)
We perform four distinct experiments with GPT-4 to assess its capacity for generating safety cases within a defined system and application domain.
arXiv Detail & Related papers (2023-12-09T22:28:48Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.