Related papers: Evolving Prompts for Toxicity Search in Large Language Models

Evolving Prompts for Toxicity Search in Large Language Models

URL: http://arxiv.org/abs/2511.12487v1
Date: Sun, 16 Nov 2025 07:47:31 GMT
Title: Evolving Prompts for Toxicity Search in Large Language Models
Authors: Onkar Shelar, Travis Desell,
Abstract summary: ToxSearch is an evolutionary framework that tests model safety by evolving prompts in a steady-state loop.<n>We observe practically meaningful but attenuated cross-model transfer, with roughly halving toxicity on most targets.<n>These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming.
Score: 3.2729350470429783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

Related papers

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction [67.45032003041399]
We propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbations.<n>SADCA establishes a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.<n>Experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.
arXiv Detail & Related papers (2026-03-05T05:46:16Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
RHINO: Guided Reasoning for Mapping Network Logs to Adversarial Tactics and Techniques with Large Language Models [9.065322387043546]
We introduce RHINO, a framework that decomposes Large Language Models into three interpretable phases mirroring human reasoning.<n> RHINO bridges the semantic gap between low-level observations and adversarial intent while improving output reliability through structured reasoning.<n>Our results demonstrate that RHINO significantly enhances the interpretability and scalability of threat analysis, offering a blueprint for deploying LLMs in operational security settings.
arXiv Detail & Related papers (2025-10-16T02:25:46Z)
Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing [77.75609817898035]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content.<n>We propose textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)<n>ARGRE explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing.
arXiv Detail & Related papers (2025-09-24T03:40:32Z)
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models [62.83226685925107]
Deceptive Reasoning Exposure Suite (D-REX) is a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output.<n>Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought.<n>We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms.
arXiv Detail & Related papers (2025-09-22T15:59:40Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense [28.398901783858005]
We propose an interpreter-based ensemble framework called X-Ensemble for robust defense adversary. X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense.
arXiv Detail & Related papers (2023-04-14T04:32:06Z)
DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles [20.46399318111058]
Adversarial attacks can mislead CNN models with small perturbations, which can effectively transfer between different models trained on the same dataset. We propose DVERGE, which isolates the adversarial vulnerability in each sub-model by distilling non-robust features. The novel diversity metric and training procedure enables DVERGE to achieve higher robustness against transfer attacks.
arXiv Detail & Related papers (2020-09-30T14:57:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.