Combating Adversarial Attacks with Multi-Agent Debate
- URL: http://arxiv.org/abs/2401.05998v1
- Date: Thu, 11 Jan 2024 15:57:38 GMT
- Title: Combating Adversarial Attacks with Multi-Agent Debate
- Authors: Steffi Chern, Zhen Fan, Andy Liu
- Abstract summary: We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks.
We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models.
- Score: 4.450536872346658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While state-of-the-art language models have achieved impressive results, they
remain susceptible to inference-time adversarial attacks, such as adversarial
prompts generated by red teams arXiv:2209.07858. One approach proposed to
improve the general quality of language model generations is multi-agent
debate, where language models self-evaluate through discussion and feedback
arXiv:2305.14325. We implement multi-agent debate between current
state-of-the-art language models and evaluate models' susceptibility to red
team attacks in both single- and multi-agent settings. We find that multi-agent
debate can reduce model toxicity when jailbroken or less capable models are
forced to debate with non-jailbroken or more capable models. We also find
marginal improvements through the general usage of multi-agent interactions. We
further perform adversarial prompt content classification via embedding
clustering, and analyze the susceptibility of different models to different
types of attack topics.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate [24.92465108034783]
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually.
The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents.
We evaluate the behavior of a network of models collaborating through debate under the influence of an adversary.
arXiv Detail & Related papers (2024-06-20T20:09:37Z) - A Generative Adversarial Attack for Multilingual Text Classifiers [10.993289209465129]
We propose an approach to fine-tune a multilingual paraphrase model with an adversarial objective.
The training objective incorporates a set of pre-trained models to ensure text quality and language consistency.
The experimental validation over two multilingual datasets and five languages has shown the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-01-16T10:14:27Z) - On the Discussion of Large Language Models: Symmetry of Agents and
Interplay with Prompts [51.3324922038486]
This paper reports the empirical results of the interplay of prompts and discussion mechanisms.
It also proposes a scalable discussion mechanism based on conquer and merge.
arXiv Detail & Related papers (2023-11-13T04:56:48Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - "What's in the box?!": Deflecting Adversarial Attacks by Randomly
Deploying Adversarially-Disjoint Models [71.91835408379602]
adversarial examples have been long considered a real threat to machine learning models.
We propose an alternative deployment-based defense paradigm that goes beyond the traditional white-box and black-box threat models.
arXiv Detail & Related papers (2021-02-09T20:07:13Z) - Adversarial Evaluation of Multimodal Models under Realistic Gray Box
Assumption [8.97147332560535]
This work examines the vulnerability of multimodal (image + text) models to adversarial threats similar to those discussed in previous literature on unimodal (image- or text-only) models.
We introduce realistic assumptions of partial model knowledge and access, and discuss how these assumptions differ from the standard "black-box"/"white-box" dichotomy common in current literature on adversarial attacks.
arXiv Detail & Related papers (2020-11-25T17:37:40Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Variational Autoencoders for Opponent Modeling in Multi-Agent Systems [9.405879323049659]
Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment.
In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies.
Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system.
arXiv Detail & Related papers (2020-01-29T13:38:59Z) - Multi-Agent Interactions Modeling with Correlated Policies [53.38338964628494]
In this paper, we cast the multi-agent interactions modeling problem into a multi-agent imitation learning framework.
We develop a Decentralized Adrial Imitation Learning algorithm with Correlated policies (CoDAIL)
Various experiments demonstrate that CoDAIL can better regenerate complex interactions close to the demonstrators.
arXiv Detail & Related papers (2020-01-04T17:31:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.