Related papers: AgenTRIM: Tool Risk Mitigation for Agentic AI

AgenTRIM: Tool Risk Mitigation for Agentic AI

URL: http://arxiv.org/abs/2601.12449v1
Date: Sun, 18 Jan 2026 15:10:18 GMT
Title: AgenTRIM: Tool Risk Mitigation for Agentic AI
Authors: Roy Betser, Shamik Bose, Amit Giloni, Chiara Picardi, Sindhu Padakandla, Roman Vainshtein,
Abstract summary: We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks.<n>AgenTRIM addresses these risks through complementary offline and online phases.<n>AgenTRIM substantially reduces attack success while maintaining high task performance.
Score: 5.4672006013914975
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: AI agents are autonomous systems that combine LLMs with external tools to solve complex tasks. While such tools extend capability, improper tool permissions introduce security risks such as indirect prompt injection and tool misuse. We characterize these failures as unbalanced tool-driven agency. Agents may retain unnecessary permissions (excessive agency) or fail to invoke required tools (insufficient agency), amplifying the attack surface and reducing performance. We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks without altering an agent's internal reasoning. AgenTRIM addresses these risks through complementary offline and online phases. Offline, AgenTRIM reconstructs and verifies the agent's tool interface from code and execution traces. At runtime, it enforces per-step least-privilege tool access through adaptive filtering and status-aware validation of tool calls. Evaluating on the AgentDojo benchmark, AgenTRIM substantially reduces attack success while maintaining high task performance. Additional experiments show robustness to description-based attacks and effective enforcement of explicit safety policies. Together, these results demonstrate that AgenTRIM provides a practical, capability-preserving approach to safer tool use in LLM-based agents.

Related papers

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents [68.20752678837377]
We propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences.<n>Using this taxonomy, we construct MT-AgentRisk, the first benchmark to evaluate multi-turn tool-using agent safety.<n>We propose ToolShield, a training-free, tool-agnostic, self-exploration defense.
arXiv Detail & Related papers (2026-02-13T18:38:18Z)
MalTool: Malicious Tool Attacks on LLM Agents [52.01975462609959]
MalTool is a coding-LLM-based framework that synthesizes tools exhibiting specified malicious behaviors.<n>We show that MalTool is highly effective even when coding LLMs are safety-aligned.
arXiv Detail & Related papers (2026-02-12T17:27:43Z)
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback [53.2744585868162]
Monitoring step-level tool invocation behaviors in real time is critical for agent deployment.<n>We first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents.<n>We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning.
arXiv Detail & Related papers (2026-01-15T07:54:32Z)
Towards Verifiably Safe Tool Use for LLM Agents [53.55621104327779]
Large language model (LLM)-based AI agents extend capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.<n>LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records.<n>Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety.
arXiv Detail & Related papers (2026-01-12T21:31:38Z)
ASTRA: Agentic Steerability and Risk Assessment Framework [3.9756746779772834]
Securing AI agents powered by Large Language Models (LLMs) is one of the most critical challenges in AI security today.<n>ASTRA is a first-of-its-kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents.
arXiv Detail & Related papers (2025-11-22T16:32:29Z)
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents [33.775221377823925]
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks.<n>When interacting with untrusted data sources, tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes.<n>We propose a novel defensive task execution paradigm, called IPIGuard, to prevent malicious tool invocations at the source.
arXiv Detail & Related papers (2025-08-21T07:08:16Z)
Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools [10.086284534400658]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools.<n>We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents.<n>We propose a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata.
arXiv Detail & Related papers (2025-08-04T06:38:59Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
Progent: Programmable Privilege Control for LLM Agents [46.31581986508561]
We introduce Progent, the first privilege control framework to secure Large Language Models agents.<n>Progent enforces security at the tool level by restricting agents to performing tool calls necessary for user tasks while blocking potentially malicious ones.<n>Thanks to our modular design, integrating Progent does not alter agent internals and only requires minimal changes to the existing agent implementation.
arXiv Detail & Related papers (2025-04-16T01:58:40Z)
AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration [0.3222802562733787]
AgentGuard is a framework to autonomously discover and validate unsafe tool-use.<n>It generates safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee.<n>The framework operates through four phases: identifying unsafe, validating them in real-world execution, generating safety constraints, and validating constraint efficacy.
arXiv Detail & Related papers (2025-02-13T23:00:33Z)
Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks. High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.