Related papers: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

URL: http://arxiv.org/abs/2402.11753v4
Date: Fri, 7 Jun 2024 17:35:17 GMT
Title: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Authors: Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran,
Abstract summary: We propose a novel ASCII art-based jailbreak attack and introduce a benchmark Vision-in-Text Challenge (ViTC) We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. We develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors.
Score: 13.008917830855832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs. Our code is available at https://github.com/uw-nsl/ArtPrompt.

Related papers

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test [1.960444962205579]
ArtPerception is a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs.<n>Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition.<n>Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack.
arXiv Detail & Related papers (2025-10-11T16:28:37Z)
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs [1.9424018922013224]
We present a novel class of jailbreak adversarial attacks on LLMs. Our approach embeds sequence-to-sequence tasks into the model's prompt to indirectly generate prohibited inputs. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models.
arXiv Detail & Related papers (2025-01-27T12:48:47Z)
Dagger Behind Smile: Fool LLMs with a Happy Ending Story [3.474162324046381]
Happy Ending Attack (HEA) wraps up a malicious request in a scenario template involving a positive prompt formed mainly via a $textithappy ending$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. Our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average.
arXiv Detail & Related papers (2025-01-19T13:39:51Z)
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries. This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z)
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars [13.496824581458547]
We introduce a novel attack framework that exploits the imaginative capacity of Large Language Models (LLMs) to achieve jailbreaking. Specifically, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities. Results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate.
arXiv Detail & Related papers (2024-12-10T10:14:03Z)
FlipAttack: Jailbreak LLMs via Flipping [63.871087708946476]
This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. We reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes.
arXiv Detail & Related papers (2024-10-02T08:41:23Z)
ObscurePrompt: Jailbreaking Large Language Models via Obscure Input [32.00508793605316]
We introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content. We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked [19.242818141154086]
Large language models (LLMs) are popular for high-quality text generation. LLMs can produce harmful content even when aligned with human values. We propose LLM Self Defense, a simple approach to defend against these attacks.
arXiv Detail & Related papers (2023-08-14T17:54:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.