HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate
- URL: http://arxiv.org/abs/2512.23717v1
- Date: Tue, 09 Dec 2025 17:56:38 GMT
- Title: HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate
- Authors: Shenzhe Zhu,
- Abstract summary: HarmTransform is a framework for transforming harmful queries into stealthier forms while preserving their underlying harmful intent.<n> Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations.<n>At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity.
- Score: 2.2299983745857896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.
Related papers
- SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation [53.47466016688839]
Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare.<n>We propose an end-to-end framework that performs knowledge-graph-guided harmful prompt generation and applies dual-path obfuscation rewriting.<n>This framework yields high-quality datasets combining strong domain relevance with implicitness.
arXiv Detail & Related papers (2026-01-08T09:05:28Z) - RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic [56.38397499463889]
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks.<n>However, they remain vulnerable to hazardous instructions that may trigger unsafe behaviors.<n>We propose RoboSafe, a runtime safeguard for embodied agents through executable predicate-based safety logic.
arXiv Detail & Related papers (2025-12-24T15:01:26Z) - MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? [12.213189431386478]
We introduce code decomposition attacks, where a malicious coding task is broken down into seemingly benign subtasks to evade safety filters.<n>To facilitate systematic evaluation, we introduce benchmarkname, a large-scale benchmark designed to evaluate robustness of code LLMs against both single-turn and multi-turn malicious prompts.<n>Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
arXiv Detail & Related papers (2025-07-25T18:11:10Z) - GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models [65.91565607573786]
Text-to-image (T2I) models can be misused to generate harmful content, including nudity or violence.<n>Recent research on red-teaming and adversarial attacks against T2I models has notable limitations.<n>We propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities.
arXiv Detail & Related papers (2025-06-11T09:09:12Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning [43.209846711845536]
Current alignment strategies rely on supervised safety fine-tuning with curated datasets.<n>We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.<n>We show that machine unlearning (MU) is a powerful alternative to supervised safety fine-tuning.
arXiv Detail & Related papers (2025-03-14T19:52:08Z) - RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting [7.0595410083835315]
RapGuard is a novel framework that uses multimodal chain-of-thought reasoning to generate scenario-specific safety prompts.<n>RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
arXiv Detail & Related papers (2024-12-25T08:31:53Z) - LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts [88.96201324719205]
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders [5.070104802923903]
Unsafe prompts pose a significant threat to Large Language Models (LLMs)
This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts.
We introduce new pairwise datasets and the Categorical Purity metric to measure this capability.
arXiv Detail & Related papers (2024-07-09T13:35:54Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.