Related papers: ASTRA: Agentic Steerability and Risk Assessment Framework

ASTRA: Agentic Steerability and Risk Assessment Framework

URL: http://arxiv.org/abs/2511.18114v1
Date: Sat, 22 Nov 2025 16:32:29 GMT
Title: ASTRA: Agentic Steerability and Risk Assessment Framework
Authors: Itay Hazan, Yael Mathov, Guy Shtar, Ron Bitton, Itsik Mantin,
Abstract summary: Securing AI agents powered by Large Language Models (LLMs) is one of the most critical challenges in AI security today.<n>ASTRA is a first-of-its-kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents.
Score: 3.9756746779772834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Securing AI agents powered by Large Language Models (LLMs) represents one of the most critical challenges in AI security today. Unlike traditional software, AI agents leverage LLMs as their "brain" to autonomously perform actions via connected tools. This capability introduces significant risks that go far beyond those of harmful text presented in a chatbot that was the main application of LLMs. A compromised AI agent can deliberately abuse powerful tools to perform malicious actions, in many cases irreversible, and limited solely by the guardrails on the tools themselves and the LLM ability to enforce them. This paper presents ASTRA, a first-of-its-kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents that enforce custom guardrails defined at the system-prompt level (e.g., "Do not send an email out of the company domain," or "Never extend the robotic arm in more than 2 meters"). Our holistic framework simulates 10 diverse autonomous agents varying between a coding assistant and a delivery drone equipped with 37 unique tools. We test these agents against a suite of novel attacks developed specifically for agentic threats, inspired by the OWASP Top 10 but adapted to challenge the ability of the LLM for policy enforcement during multi-turn planning and execution of strict tool activation. By evaluating 13 open-source, tool-calling LLMs, we uncovered surprising and significant differences in their ability to remain secure and keep operating within their boundaries. The purpose of this work is to provide the community with a robust and unified methodology to build and validate better LLMs, ultimately pushing for more secure and reliable agentic AI systems.

Related papers

AgenTRIM: Tool Risk Mitigation for Agentic AI [5.4672006013914975]
We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks.<n>AgenTRIM addresses these risks through complementary offline and online phases.<n>AgenTRIM substantially reduces attack success while maintaining high task performance.
arXiv Detail & Related papers (2026-01-18T15:10:18Z)
Towards Verifiably Safe Tool Use for LLM Agents [53.55621104327779]
Large language model (LLM)-based AI agents extend capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.<n>LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records.<n>Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety.
arXiv Detail & Related papers (2026-01-12T21:31:38Z)
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents [36.2255033141489]
AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security.<n>We introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where vulnerabilities manifest.<n>We apply this framework to construct the $operatornameb3$ benchmark, a security benchmark based on 194331 unique crowdsourced adversarial attacks.
arXiv Detail & Related papers (2025-10-26T10:36:42Z)
The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover [0.0]
Large Language Model (LLM) agents and multi-agent systems introduce security vulnerabilities that extend beyond traditional content generation to system-level compromises.<n>This paper presents a comprehensive evaluation of the LLMs security used as reasoning engines within autonomous agents.<n>We show how different attack surfaces and trust boundaries can be leveraged to orchestrate such takeovers.
arXiv Detail & Related papers (2025-07-09T13:54:58Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents [15.15485816037418]
This paper presents the first systematic security analysis of task control flows in multi-tool-enabled LLM agents.<n>We identify a novel threat, Cross-Tool Harvesting and Polluting (XTHP), which includes multiple attack vectors.<n>To understand the impact of this threat, we developed Chord, a dynamic scanning tool designed to automatically detect real-world agent tools susceptible to XTHP attacks.
arXiv Detail & Related papers (2025-04-04T01:41:06Z)
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning [17.448966928905733]
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks.<n>We present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior.
arXiv Detail & Related papers (2025-02-28T21:30:28Z)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [58.65256663334316]
We present SafeAgentBench -- the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Global Challenge for Safe and Secure LLMs Track 1 [57.08717321907755]
The Global Challenge for Safe and Secure Large Language Models (LLMs) is a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks.
arXiv Detail & Related papers (2024-11-21T08:20:31Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence. This paper uncovers a significant backdoor security threat within this process. By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.