UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
- URL: http://arxiv.org/abs/2503.01908v1
- Date: Fri, 28 Feb 2025 21:30:28 GMT
- Title: UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
- Authors: Jiawei Zhang, Shuang Yang, Bo Li,
- Abstract summary: Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for handling complex tasks.<n>We present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent's own reasoning processes to compel it toward malicious behavior.
- Score: 17.448966928905733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for handling complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements also amplify the risks of adversarial attacks, particularly when LLM agents can access sensitive external functionalities. Moreover, because LLM agents engage in extensive reasoning or planning before executing final actions, manipulating them into performing targeted malicious actions or invoking specific tools remains a significant challenge. Consequently, directly embedding adversarial strings in malicious instructions or injecting malicious prompts into tool interactions has become less effective against modern LLM agents. In this work, we present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent's own reasoning processes to compel it toward malicious behavior. Specifically, UDora first samples the model's reasoning for the given task, then automatically identifies multiple optimal positions within these reasoning traces to insert targeted perturbations. Subsequently, it uses the modified reasoning as the objective to optimize the adversarial strings. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets.
Related papers
- Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents [15.15485816037418]
This paper presents the first systematic security analysis of task control flows in multi-tool-enabled LLM agents.
We identify a novel threat, Cross-Tool Harvesting and Polluting (XTHP), which includes multiple attack vectors.
To understand the impact of this threat, we developed Chord, a dynamic scanning tool designed to automatically detect real-world agent tools susceptible to XTHP attacks.
arXiv Detail & Related papers (2025-04-04T01:41:06Z) - A Practical Memory Injection Attack against LLM Agents [49.01756339657071]
MINJA enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations.
MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
arXiv Detail & Related papers (2025-03-05T17:53:24Z) - Towards Action Hijacking of Large Language Model-based Agent [39.19067800226033]
We introduce Name, a novel hijacking attack to manipulate the action plans of black-box agent system.
Our approach achieved an average bypass rate of 92.7% for safety filters.
arXiv Detail & Related papers (2024-12-14T12:11:26Z) - From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection [11.300387488829035]
Tool-calling has changed Large Language Model (LLM) applications by integrating external tools.<n>We present ToolCommander, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection.
arXiv Detail & Related papers (2024-12-13T15:15:24Z) - Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative [55.08395463562242]
Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI)
Our paper explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content.
arXiv Detail & Related papers (2024-02-20T23:08:21Z) - Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [47.219047422240145]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms.
arXiv Detail & Related papers (2024-02-17T06:48:45Z) - Formally Specifying the High-Level Behavior of LLM-Based Agents [24.645319505305316]
LLMs have emerged as promising tools for solving challenging problems without the need for task-specific finetuned models.
Currently, the design and implementation of such agents is ad hoc, as the wide variety of tasks that LLM-based agents may be applied to naturally means there can be no one-size-fits-all approach to agent design.
We propose a minimalistic generation framework that simplifies the process of building agents.
arXiv Detail & Related papers (2023-10-12T17:24:15Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.