Related papers: Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

URL: http://arxiv.org/abs/2501.08246v1
Date: Tue, 14 Jan 2025 16:32:01 GMT
Title: Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Authors: Jonathan Nöther, Adish Singla, Goran Radanović,
Abstract summary: We introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART)<n>DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced.<n>Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.
Score: 20.542545906686318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

Related papers

SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning [18.219912912964812]
We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents.<n>We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases.<n>It iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts.
arXiv Detail & Related papers (2025-10-30T00:32:58Z)
AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming [58.70941433155648]
AutoRed is a free-form adversarial prompt generation framework that removes the need for seed instructions.<n>We build two red teaming datasets and evaluate eight state-of-the-art Large Language Models.<n>Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for safety evaluation.
arXiv Detail & Related papers (2025-10-09T15:17:28Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
Who's the Evil Twin? Differential Auditing for Undesired Behavior [0.6524460254566904]
We frame detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior.<n>We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks.<n>Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising.
arXiv Detail & Related papers (2025-08-09T04:57:38Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks Against Black-box Neural Ranking Models [111.58315434849047]
We introduce a novel ranking attack framework named Attack-in-the-Chain. It tracks interactions between large language models (LLMs) and Neural ranking models (NRMs) based on chain-of-thought. Empirical results on two web search benchmarks show the effectiveness of our method.
arXiv Detail & Related papers (2024-12-25T04:03:09Z)
One Shot is Enough for Sequential Infrared Small Target Segmentation [9.354927663020586]
Infrared small target sequences exhibit strong similarities between frames and contain rich contextual information. We propose a one-shot and training-free method that perfectly adapts SAM's zero-shot generalization capability to sequential IRSTS. Experiments demonstrate that our method requires only one shot to achieve comparable performance to state-of-the-art IRSTS methods.
arXiv Detail & Related papers (2024-08-09T02:36:56Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming [37.32997502058661]
This paper introduces the textbfsentinel model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few tokens. The sentinel model naturally overcomes the textit parameter inefficiency and textitlimited model accessibility for fine-tuning large target models. Our experiments across text-to-text and text-to-image demonstrate the effectiveness of our approach in mitigating toxic outputs.
arXiv Detail & Related papers (2024-05-21T08:57:44Z)
Curiosity-driven Red-teaming for Large Language Models [43.448044721642916]
Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. relying solely on human testers is expensive and time-consuming. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods.
arXiv Detail & Related papers (2024-02-29T18:55:03Z)
Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method [115.29382166356478]
We introduce the adversarial retrieval attack (AREA) task. It is meant to trick DR models into retrieving a target document that is outside the initial set of candidate documents retrieved by the DR model. We find that the promising results that have previously been reported on attacking NRMs, do not generalize to DR models. We propose to formalize attacks on DR models as a contrastive learning problem in a multi-view representation space.
arXiv Detail & Related papers (2023-08-19T00:24:59Z)
FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)
Explore, Establish, Exploit: Red Teaming Language Models from Scratch [7.949645304649025]
We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
arXiv Detail & Related papers (2023-06-15T18:49:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.