Related papers: Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

URL: http://arxiv.org/abs/2506.23576v1
Date: Mon, 30 Jun 2025 07:29:07 GMT
Title: Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models
Authors: Maria Carolina Cornelia Wit, Jun Pang,
Abstract summary: This paper investigates the use of multi-agent LLM systems as a defence against jailbreaking attacks.<n>We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB.<n>Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives.
Score: 4.757470449749876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.

Related papers

Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs [13.54228868302755]
ArrAttack is an attack method designed to target defended large language models (LLMs)<n>ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures.<n>Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
arXiv Detail & Related papers (2025-05-23T08:02:38Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks [7.31505609352525]
Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.<n>We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.<n>Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
arXiv Detail & Related papers (2024-12-10T17:02:28Z)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks.<n>We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z)
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG) PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [20.5016054418053]
AutoDefense is a multi-agent defense framework that filters harmful responses from large language models. Our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models.
arXiv Detail & Related papers (2024-03-02T16:52:22Z)
Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.