Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
- URL: http://arxiv.org/abs/2404.00629v1
- Date: Sun, 31 Mar 2024 09:50:39 GMT
- Title: Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
- Authors: Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, Xudong Han, Haonan Li,
- Abstract summary: The field of red teaming is experiencing fast-paced growth, which highlights the need for a comprehensive organization covering the entire pipeline.
Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models.
We have developed the searcher framework that unifies various automatic red teaming approaches.
- Score: 60.21722603260243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safety issues as various vulnerabilities are exposed. Faced with the problem, the field of red teaming is experiencing fast-paced growth, which highlights the need for a comprehensive organization covering the entire pipeline and addressing emerging topics for the community. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the searcher framework that unifies various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around multilingual models, overkill of harmless queries, and safety of downstream applications. We hope this survey can provide a systematic perspective on the field and unlock new areas of research.
Related papers
- CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media [5.384787836425144]
This study addresses the challenge of efficiently identifying and matching assistance requests and offers on social media platforms during emergencies.
We propose CReMa, a systematic approach that integrates textual, temporal, and spatial features for multi-lingual request-offer matching.
We introduce a novel multi-lingual dataset that simulates scenarios of help-seeking and offering assistance on social media across the 16 most commonly used languages in Australia.
arXiv Detail & Related papers (2024-05-20T09:30:03Z) - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence.
It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works.
Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z) - Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts [57.49685172971446]
We present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts.
Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90%.
We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity.
arXiv Detail & Related papers (2024-02-26T18:47:27Z) - Survey of Vulnerabilities in Large Language Models Revealed by
Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability.
As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows.
This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual
Learning [76.47138162283714]
Forgetting refers to the loss or deterioration of previously acquired information or knowledge.
Forgetting is a prevalent phenomenon observed in various other research domains within deep learning.
Survey argues that forgetting is a double-edged sword and can be beneficial and desirable in certain cases.
arXiv Detail & Related papers (2023-07-16T16:27:58Z) - Language Generation Models Can Cause Harm: So What Can We Do About It?
An Actionable Survey [50.58063811745676]
This work provides a survey of practical methods for addressing potential threats and societal harms from language generation models.
We draw on several prior works' of language model risks to present a structured overview of strategies for detecting and ameliorating different kinds of risks/harms of language generators.
arXiv Detail & Related papers (2022-10-14T10:43:39Z) - A Unifying Framework for Formal Theories of Novelty:Framework, Examples
and Discussion [0.0]
Managing inputs that are novel, unknown, or out-of-distribution is critical as an agent moves from the lab to the open world.
We present the first unified framework for formal theories of novelty and use the framework to formally define a family of novelty types.
Our framework can be applied across a wide range of domains, from symbolic AI to reinforcement learning, and beyond to open world image recognition.
arXiv Detail & Related papers (2020-12-08T05:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.