AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration
- URL: http://arxiv.org/abs/2502.09809v1
- Date: Thu, 13 Feb 2025 23:00:33 GMT
- Title: AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration
- Authors: Jizhou Chen, Samuel Lee Cong,
- Abstract summary: AgentGuard is a framework to autonomously discover and validate unsafe tool-use.
It generates safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee.
The framework operates through four phases: identifying unsafe, validating them in real-world execution, generating safety constraints, and validating constraint efficacy.
- Score: 0.3222802562733787
- License:
- Abstract: The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.
Related papers
- AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [47.83354878065321]
We propose AGrail, a lifelong guardrail to enhance agent safety.
AGrail features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility.
arXiv Detail & Related papers (2025-02-17T05:12:33Z) - Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states.
Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs)
Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.
Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [42.69984822098671]
We present SafeAgentBench -- a new benchmark for safety-aware task planning of embodied LLM agents.
Best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks.
arXiv Detail & Related papers (2024-12-17T18:55:58Z) - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [20.796190000442053]
We introduce MobileSafetyBench, a benchmark designed to evaluate the safety of mobile device-control agents.
We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications.
Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks.
arXiv Detail & Related papers (2024-10-23T02:51:43Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - Safeguarding AI Agents: Developing and Analyzing Safety Architectures [0.0]
This paper addresses the need for safety measures in AI systems that collaborate with human teams.
We propose and evaluate three frameworks to enhance safety protocols in AI agent systems.
We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems.
arXiv Detail & Related papers (2024-09-03T10:14:51Z) - Athena: Safe Autonomous Agents with Verbal Contrastive Learning [3.102303947219617]
Large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks.
In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning.
The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step.
arXiv Detail & Related papers (2024-08-20T17:21:10Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.