MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
- URL: http://arxiv.org/abs/2505.17147v1
- Date: Thu, 22 May 2025 08:22:57 GMT
- Title: MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
- Authors: Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang,
- Abstract summary: textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
- Score: 38.25556351567948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
Related papers
- Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models [34.601888589730194]
This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations.<n>QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner.<n>Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs.
arXiv Detail & Related papers (2025-06-08T13:07:41Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Towards Robust Multimodal Large Language Models Against Jailbreak Attacks [24.491648943977605]
We introduce SafeMLLM, which alternates between an attack step for generating adversarial noise and a model updating step.<n>At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack)<n>We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities.
arXiv Detail & Related papers (2025-02-02T03:45:49Z) - SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models [19.486685336959482]
Large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial.<n>A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming.<n>We introduce the $mathbfStextelf-mathbfEtextvolving mathbfAtextdversarial mathbfStextafetyety mathbf(SEAS)$ optimization framework, which enhances security by leveraging data generated by the model itself.
arXiv Detail & Related papers (2024-08-05T16:55:06Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming [37.32997502058661]
This paper introduces the textbfsentinel model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few tokens.
The sentinel model naturally overcomes the textit parameter inefficiency and textitlimited model accessibility for fine-tuning large target models.
Our experiments across text-to-text and text-to-image demonstrate the effectiveness of our approach in mitigating toxic outputs.
arXiv Detail & Related papers (2024-05-21T08:57:44Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.