Related papers: Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

URL: http://arxiv.org/abs/2506.10949v2
Date: Sat, 14 Jun 2025 15:17:52 GMT
Title: Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Authors: Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He,
Abstract summary: Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals.<n>We propose adding an external monitor that observes the conversation at a higher granularity.<n>We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor.
Score: 27.976136688947093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor. Moreover, it remains robust against random task injection and cuts cost by 90% and latency by 50%. Our findings suggest that lightweight sequential monitors are highly effective in mitigating decomposition attacks and are viable in deployment.

Related papers

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models [23.236088751922807]
Backdoor attacks pose a significant threat to Large Language Models (LLMs)<n>Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs.<n>We propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock.
arXiv Detail & Related papers (2025-08-02T13:38:04Z)
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents [8.02267424051267]
Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings.<n>We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks.<n>We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena.
arXiv Detail & Related papers (2025-06-17T15:46:15Z)
Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection.<n>To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities.<n>Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z)
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks [10.732558183444985]
Malicious actors can covertly exploit large language models (LLMs) vulnerabilities through poisoning attacks aimed at generating undesirable outputs. This paper explores various poisoning techniques to assess their effectiveness across a range of generative tasks. We show that it is possible to successfully poison an LLM during the fine-tuning stage using as little as 1% of the total tuning data samples.
arXiv Detail & Related papers (2023-12-07T23:26:06Z)
Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting [61.99564267735242]
Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Recent studies have demonstrated that deep neural network (DNN) methods are vulnerable to adversarial attacks. We propose a robust attack strategy called Adversarial Patch Attack with Momentum to evaluate the robustness of crowd counting models.
arXiv Detail & Related papers (2021-04-22T05:10:55Z)
Robust Tracking against Adversarial Attacks [69.59717023941126]
We first attempt to generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks. We apply the proposed adversarial attack and defense approaches to state-of-the-art deep tracking algorithms.
arXiv Detail & Related papers (2020-07-20T08:05:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.