Related papers: Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

URL: http://arxiv.org/abs/2503.20320v1
Date: Wed, 26 Mar 2025 08:40:46 GMT
Title: Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models
Authors: Shih-Wen Ke, Guan-Yu Lai, Guo-Lin Fang, Hsi-Yuan Kao,
Abstract summary: This study exploits large language models (LLMs) with an iterative prompting technique.<n>We analyze the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM.<n>Persuading strategies enhance prompt effectiveness while maintaining consistency with malicious intent.
Score: 2.1511703382556657
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are designed to align with human values in their responses. This study exploits LLMs with an iterative prompting technique where each prompt is systematically modified and refined across multiple iterations to enhance its effectiveness in jailbreaking attacks progressively. This technique involves analyzing the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM, allowing us to adjust and optimize prompts to evade the LLMs' ethical and security constraints. Persuasion strategies enhance prompt effectiveness while maintaining consistency with malicious intent. Our results show that the attack success rates (ASR) increase as the attacking prompts become more refined with the highest ASR of 90% for GPT4 and ChatGLM and the lowest ASR of 68% for LLaMa2. Our technique outperforms baseline techniques (PAIR and PAP) in ASR and shows comparable performance with GCG and ArtPrompt.

Related papers

MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models [5.645247459469767]
We propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs.<n>For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique.<n>For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM's responses.
arXiv Detail & Related papers (2025-05-29T12:50:57Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation [12.303965802578087]
This paper introduces the GAP (Graph of Attacks with Pruning) framework for generating stealthy jailbreak prompts.<n>GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure.<n>We present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks.
arXiv Detail & Related papers (2025-01-28T17:10:20Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.<n>This vulnerability poses significant risks to the real-world applications.<n>We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs [13.36946005380889]
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Our method significantly outperforms existing red-teaming approaches, achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% increase on Claude 2.
arXiv Detail & Related papers (2024-11-13T18:44:30Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation [42.797865918373326]
We study the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks. We introduce an enhanced method that manipulates models' attention scores to facilitate jailbreaking. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs.
arXiv Detail & Related papers (2024-10-11T17:55:09Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated. We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z)
Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.