Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
- URL: http://arxiv.org/abs/2502.16109v1
- Date: Sat, 22 Feb 2025 06:13:19 GMT
- Title: Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
- Authors: Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, Zhifang Sui,
- Abstract summary: Red teaming aims to find prompts that could elicit harmful responses from Large Language Models (LLMs)<n>We propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions.<n>Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity.
- Score: 24.303849271284676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.
Related papers
- Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models [1.9574002186090496]
The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns.
Researchers have recently complemented these efforts with an offensive approach that involves red teaming.
This paper provides a concise and practical overview of the LLM red teaming literature.
arXiv Detail & Related papers (2025-03-03T17:04:22Z) - Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts [25.661444231400772]
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs)
These advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content.
We introduce Arondight, a standardized red team framework tailored specifically for VLMs.
arXiv Detail & Related papers (2024-07-21T04:37:11Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Against The Achilles' Heel: A Survey on Red Teaming for Generative Models [60.21722603260243]
Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models.
We have developed the "searcher" framework to unify various automatic red teaming approaches.
arXiv Detail & Related papers (2024-03-31T09:50:39Z) - Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts [57.49685172971446]
We present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts.<n>Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90%.<n>We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity.
arXiv Detail & Related papers (2024-02-26T18:47:27Z) - Gradient-Based Language Model Red Teaming [9.972783485792885]
Red teaming is a strategy for identifying weaknesses in generative language models (LMs)
Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans.
We present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses.
arXiv Detail & Related papers (2024-01-30T01:19:25Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games [11.873513881458747]
Red team can identify vulnerabilities by attacking Large Language Model (LLM) to attain safety.
Current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams.
Here we introduce dynamic Red Team Game (RTG) to analyze the multi-round offensive and defensive interactions between red team and blue team.
arXiv Detail & Related papers (2023-09-30T09:35:50Z) - FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [76.18347405302728]
This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic.
The adversarial prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving.
Our findings demonstrate that contemporary Large Language Models are not robust to adversarial prompts.
arXiv Detail & Related papers (2023-06-07T15:37:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.