Testing Language Model Agents Safely in the Wild
- URL: http://arxiv.org/abs/2311.10538v3
- Date: Sun, 3 Dec 2023 13:18:09 GMT
- Title: Testing Language Model Agents Safely in the Wild
- Authors: Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift,
Douglas Schonholtz, Adam Tauman Kalai, David Bau
- Abstract summary: We propose a framework for conducting safe autonomous agent tests on the open internet.
Agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary.
Using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations.
- Score: 19.507292491433738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet
real-world autonomous tests face several unique safety challenges, both due to
the possibility of causing harm during a test, as well as the risk of
encountering new unsafe agent behavior through interactions with real-world and
potentially malicious actors. We propose a framework for conducting safe
autonomous agent tests on the open internet: agent actions are audited by a
context-sensitive monitor that enforces a stringent safety boundary to stop an
unsafe test, with suspect behavior ranked and logged to be examined by humans.
We design a basic safety monitor (AgentMonitor) that is flexible enough to
monitor existing LLM agents, and, using an adversarial simulated agent, we
measure its ability to identify and stop unsafe situations. Then we apply the
AgentMonitor on a battery of real-world tests of AutoGPT, and we identify
several limitations and challenges that will face the creation of safe
in-the-wild tests as autonomous agents grow more capable.
Related papers
- AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [47.83354878065321]
We propose AGrail, a lifelong guardrail to enhance agent safety.
AGrail features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility.
arXiv Detail & Related papers (2025-02-17T05:12:33Z) - AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration [0.3222802562733787]
AgentGuard is a framework to autonomously discover and validate unsafe tool-use.
It generates safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee.
The framework operates through four phases: identifying unsafe, validating them in real-world execution, generating safety constraints, and validating constraint efficacy.
arXiv Detail & Related papers (2025-02-13T23:00:33Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs)
Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.
Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [20.796190000442053]
We introduce MobileSafetyBench, a benchmark designed to evaluate the safety of mobile device-control agents.
We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications.
Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks.
arXiv Detail & Related papers (2024-10-23T02:51:43Z) - HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.42274173122328]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions.
We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education)
Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z) - Anomalous State Sequence Modeling to Enhance Safety in Reinforcement Learning [0.0]
We propose a safe reinforcement learning (RL) approach that utilizes an anomalous state sequence to enhance RL safety.
In experiments on multiple safety-critical environments including self-driving cars, our solution approach successfully learns safer policies.
arXiv Detail & Related papers (2024-07-29T10:30:07Z) - PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence.
However, the potential misuse of this intelligence for malicious purposes presents significant risks.
We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors.
Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - On Assessing The Safety of Reinforcement Learning algorithms Using
Formal Methods [6.2822673562306655]
Safety mechanisms such as adversarial training, adversarial detection, and robust learning are not always adapted to all disturbances in which the agent is deployed.
It is therefore necessary to propose new solutions adapted to the learning challenges faced by the agent.
We use reward shaping and a modified Q-learning algorithm as defense mechanisms to improve the agent's policy when facing adversarial perturbations.
arXiv Detail & Related papers (2021-11-08T23:08:34Z) - Safe Reinforcement Learning via Curriculum Induction [94.67835258431202]
In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly.
Existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations.
This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor.
arXiv Detail & Related papers (2020-06-22T10:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.