Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
- URL: http://arxiv.org/abs/2505.17735v1
- Date: Fri, 23 May 2025 10:56:06 GMT
- Title: Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
- Authors: Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun,
- Abstract summary: Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
- Score: 77.86600052899156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at https://auto-safe.github.io/.
Related papers
- Agent Safety Alignment via Reinforcement Learning [29.759393704688986]
We propose the first unified safety-alignment framework for tool-using agents.<n>We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses.<n>Our results show that safety and effectiveness can be jointly optimized.
arXiv Detail & Related papers (2025-07-11T02:34:16Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [30.535665641990114]
We present IS-Bench, the first multi-modal benchmark designed for interactive safety.<n>It features 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator.<n>It facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.
arXiv Detail & Related papers (2025-06-19T15:34:46Z) - SafeCast: Risk-Responsive Motion Forecasting for Autonomous Vehicles [12.607007386467329]
We present SafeCast, a risk-responsive motion forecasting model.<n>It integrates safety-aware decision-making with uncertainty-aware adaptability.<n>Our model achieves state-of-the-art (SOTA) accuracy while maintaining a lightweight architecture and low inference latency.
arXiv Detail & Related papers (2025-03-28T15:38:21Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - SafeDrive: Knowledge- and Data-Driven Risk-Sensitive Decision-Making for Autonomous Vehicles with Large Language Models [14.790308656087316]
SafeDrive is a knowledge- and data-driven risk-sensitive decision-making framework to enhance autonomous driving safety and adaptability.<n>By integrating knowledge-driven insights with adaptive learning mechanisms, the framework ensures robust decision-making under uncertain conditions.
arXiv Detail & Related papers (2024-12-17T16:45:27Z) - Don't Let Your Robot be Harmful: Responsible Robotic Manipulation [57.70648477564976]
Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks.<n>We present Safety-as-policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections.<n>We show that Safety-as-policy can avoid risks and efficiently complete tasks in both synthetic dataset and real-world experiments.
arXiv Detail & Related papers (2024-11-27T12:27:50Z) - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [20.796190000442053]
We introduce MobileSafetyBench, a benchmark designed to evaluate the safety of mobile device-control agents.<n>We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications.<n>Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks.
arXiv Detail & Related papers (2024-10-23T02:51:43Z) - HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.42274173122328]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions.
We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education)
Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z) - Safeguarding AI Agents: Developing and Analyzing Safety Architectures [0.0]
This paper addresses the need for safety measures in AI systems that collaborate with human teams.<n>We propose and evaluate three frameworks to enhance safety protocols in AI agent systems.<n>We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems.
arXiv Detail & Related papers (2024-09-03T10:14:51Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.