Related papers: Safe Multi-Agent Reinforcement Learning via Shielding

Safe Multi-Agent Reinforcement Learning via Shielding

URL: http://arxiv.org/abs/2101.11196v2
Date: Tue, 2 Feb 2021 18:30:53 GMT
Title: Safe Multi-Agent Reinforcement Learning via Shielding
Authors: Ingy Elsayed-Aly, Suda Bharadwaj, Christopher Amato, R\"udiger Ehlers, Ufuk Topcu, Lu Feng
Abstract summary: Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications. Current MARL methods do not have safety guarantees. We present two shielding approaches for safe MARL.
Score: 29.49529835154155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications, which require guaranteed safety (e.g., no unsafe states are ever visited) during the learning process.Unfortunately, current MARL methods do not have safety guarantees. Therefore, we present two shielding approaches for safe MARL. In centralized shielding, we synthesize a single shield to monitor all agents' joint actions and correct any unsafe action if necessary. In factored shielding, we synthesize multiple shields based on a factorization of the joint state space observed by all agents; the set of shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step.Experimental results show that both approaches can guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.

Related papers

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Probabilistic Shielding for Safe Reinforcement Learning [51.35559820893218]
In real-life scenarios, a Reinforcement Learning (RL) agent must often also behave in a safe manner, including at training time. We present a new, scalable method, which enjoys strict formal guarantees for Safe RL. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time.
arXiv Detail & Related papers (2025-03-09T17:54:33Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a comprehensive benchmark to evaluate the safety of large language models (LLMs) Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
Compositional Shielding and Reinforcement Learning for Multi-Agent Systems [1.124958340749622]
Deep reinforcement learning has emerged as a powerful tool for obtaining high-performance policies. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. In this work, we propose a novel approach for multi-agent shielding.
arXiv Detail & Related papers (2024-10-14T12:52:48Z)
Realizable Continuous-Space Shields for Safe Reinforcement Learning [13.728961635717134]
Deep Reinforcement Learning (DRL) remains vulnerable to occasional catastrophic failures without additional safeguards. One effective solution is to use a shield that validates and adjusts the agent's actions to ensure compliance with a provided set of safety specifications. We propose the first shielding approach to automatically guarantee the realizability of safety requirements for continuous state and action spaces.
arXiv Detail & Related papers (2024-10-02T21:08:11Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Verification-Guided Shielding for Deep Reinforcement Learning [4.418183967223081]
Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. Various methods have been put forth to address this issue by providing formal safety guarantees. We present verification-guided shielding -- a novel approach that bridges the DRL reliability gap by integrating these two methods.
arXiv Detail & Related papers (2024-06-10T17:44:59Z)
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA. textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [90.73444232283371]
ShieldLM is a safety detector for Large Language Models (LLMs) that aligns with common safety standards. We show that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability.
arXiv Detail & Related papers (2024-02-26T09:43:02Z)
Approximate Model-Based Shielding for Safe Reinforcement Learning [83.55437924143615]
We propose a principled look-ahead shielding algorithm for verifying the performance of learned RL policies. Our algorithm differs from other shielding approaches in that it does not require prior knowledge of the safety-relevant dynamics of the system. We demonstrate superior performance to other safety-aware approaches on a set of Atari games with state-dependent safety-labels.
arXiv Detail & Related papers (2023-07-27T15:19:45Z)
Approximate Shielding of Atari Agents for Safe Exploration [83.55437924143615]
We propose a principled algorithm for safe exploration based on the concept of shielding. We present preliminary results that show our approximate shielding algorithm effectively reduces the rate of safety violations.
arXiv Detail & Related papers (2023-04-21T16:19:54Z)
Model-based Dynamic Shielding for Safe and Efficient Multi-Agent Reinforcement Learning [7.103977648997475]
Multi-Agent Reinforcement Learning (MARL) discovers policies that maximize reward but do not have safety guarantees during the learning and deployment phases. Model-based Dynamic Shielding (MBDS) to support MARL algorithm design.
arXiv Detail & Related papers (2023-04-13T06:08:10Z)
Online Shielding for Reinforcement Learning [59.86192283565134]
We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent.
arXiv Detail & Related papers (2022-12-04T16:00:29Z)
Safe Reinforcement Learning via Shielding for POMDPs [29.058332307331785]
Reinforcement learning (RL) in safety-critical environments requires an agent to avoid decisions with catastrophic consequences. We propose and thoroughly evaluate a tight integration of formally-verified shields for POMDPs with state-of-the-art deep RL algorithms. We empirically demonstrate that an RL agent using a shield, beyond being safe, converges to higher values of expected reward.
arXiv Detail & Related papers (2022-04-02T03:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.