Related papers: Dagger Behind Smile: Fool LLMs with a Happy Ending Story

Dagger Behind Smile: Fool LLMs with a Happy Ending Story

URL: http://arxiv.org/abs/2501.13115v2
Date: Mon, 17 Feb 2025 02:23:21 GMT
Title: Dagger Behind Smile: Fool LLMs with a Happy Ending Story
Authors: Xurui Song, Zhixin Xie, Shuo Huai, Jiayi Kong, Jun Luo,
Abstract summary: Happy Ending Attack (HEA) wraps up a malicious request in a scenario template involving a positive prompt formed mainly via a $textithappy ending$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request.<n>Our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average.
Score: 3.474162324046381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The wide adoption of Large Language Models (LLMs) has attracted significant attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious contents. However, optimization-based attacks have limited efficiency and transferability, while existing manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to $\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request.This has made HEA both efficient and effective, as it requires only up to two turns to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79\% attack success rate on average. We also provide quantitative explanations for the success of HEA.

Related papers

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization [52.01940078632388]
We introduce TASO, a novel jailbreak method that optimize both a template and a suffix in an alternating manner.<n>We evaluate the effectiveness of TASO on benchmark datasets on 24 leading LLMs.
arXiv Detail & Related papers (2025-11-23T18:49:27Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content.<n> adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols.<n>We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer [33.67942887761857]
We present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. We employ task prompts to translate jailbreaking goals into natural language instructions, which guides the LLM to generate adversarial suffixes for malicious queries. ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times.
arXiv Detail & Related papers (2024-08-21T03:35:24Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Efficient LLM-Jailbreaking by Introducing Visual Modality [28.925716670778076]
This paper focuses on jailbreaking attacks against large language models (LLMs) Our approach begins by constructing a multimodal large language model (MLLM) through the incorporation of a visual module into the target LLM. We convert the embJS into text space to facilitate the jailbreaking of the target LLM.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues [16.97760778679782]
We propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense strategy and obtain malicious response. Our experiments show that Puzzler achieves a query success rate of 96.6% on closed-source LLMs. When tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines.
arXiv Detail & Related papers (2024-02-14T11:11:51Z)
Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs) A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP) Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.