Related papers: When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

URL: http://arxiv.org/abs/2406.08705v2
Date: Tue, 15 Oct 2024 19:41:42 GMT
Title: When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Authors: Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang,
Abstract summary: We propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL) We show that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs.
Score: 12.76161683514808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to fool LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks. In this paper, we propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement learning (DRL). We model jailbreaking as a search problem and design an RL agent to guide the search, which is more effective and has less randomness than stochastic search, such as genetic algorithms. Specifically, we design a customized DRL system for the jailbreaking problem, including a novel reward function and a customized proximal policy optimization (PPO) algorithm. Through extensive experiments, we demonstrate that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We also show that RLbreaker is robust against three SOTA defenses and its trained agents can transfer across different LLMs. We further validate the key design choices of RLbreaker via a comprehensive ablation study.

Related papers

TooBadRL: Trigger Optimization to Boost Effectiveness of Backdoor Attacks on Deep Reinforcement Learning [38.79063331759597]
TooBadRL is a framework to systematically optimize DRL backdoor triggers along three critical axes, i.e., temporal, spatial, and magnitude.<n>We show that TooBadRL significantly improves attack success rates, while ensuring minimal degradation of normal task performance.
arXiv Detail & Related papers (2025-06-11T09:50:17Z)
Geneshift: Impact of different scenario shift on Jailbreaking LLM [55.26229741296822]
We propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. We show that GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
arXiv Detail & Related papers (2025-04-10T20:02:35Z)
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition [34.32871896067864]
We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP) RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking. Empirical studies in the math domain show that RLSP improves reasoning.
arXiv Detail & Related papers (2025-02-10T18:52:04Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models [59.29840790102413]
Existing jailbreak attacks are primarily based on opaque optimization techniques and gradient search methods. We propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. Our results show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv Detail & Related papers (2024-12-28T07:48:57Z)
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs) ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs [14.1985036536366]
We propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL) Our method includes a series of customized designs to enhance the RL agent's learning efficiency in the jailbreaking context. We demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs.
arXiv Detail & Related papers (2024-06-13T01:05:22Z)
LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence [68.27280750612204]
Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs) In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM) Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text.
arXiv Detail & Related papers (2024-05-27T17:59:32Z)
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models. Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z)
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs) We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs) We propose a new RL method named RLMEC that incorporates a generative model as the reward model. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
Train Hard, Fight Easy: Robust Meta Reinforcement Learning [78.16589993684698]
A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. In this work, we define a robust MRL objective with a controlled level. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML)
arXiv Detail & Related papers (2023-01-26T14:54:39Z)
BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets [31.122826345966065]
Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) We propose Baffle, an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset.
arXiv Detail & Related papers (2022-10-07T07:56:17Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.