Related papers: A Domain-Agnostic Scalable AI Safety Ensuring Framework

A Domain-Agnostic Scalable AI Safety Ensuring Framework

URL: http://arxiv.org/abs/2504.20924v4
Date: Fri, 23 May 2025 13:19:17 GMT
Title: A Domain-Agnostic Scalable AI Safety Ensuring Framework
Authors: Beomjun Kim, Kangyeon Kim, Sunwoo Kim, Heejin Ahn,
Abstract summary: We propose a novel framework that guarantees AI systems satisfy user-defined safety constraints with specified probabilities.<n>Our approach combines any AI model with an optimization problem that ensures outputs meet safety requirements while maintaining performance.<n>We prove our method guarantees probabilistic safety under mild conditions and establish the first scaling law in AI safety.
Score: 8.086635708001166
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Ensuring the safety of AI systems has emerged as a critical priority as these systems are increasingly deployed in real-world applications. We propose a novel domain-agnostic framework that guarantees AI systems satisfy user-defined safety constraints with specified probabilities. Our approach combines any AI model with an optimization problem that ensures outputs meet safety requirements while maintaining performance. The key challenge is handling uncertain constraints -- those whose satisfaction cannot be deterministically evaluated~(e.g., whether a chatbot response is ``harmful''). We address this through three innovations: (1) a safety classification model that assesses constraint satisfaction probability, (2) internal test data to evaluate this classifier's reliability, and (3) conservative testing to prevent overfitting when this data is used in training. We prove our method guarantees probabilistic safety under mild conditions and establish the first scaling law in AI safety -- showing that the safety-performance trade-off improves predictably with more internal test data. Experiments across production planning, reinforcement learning, and language generation demonstrate our framework achieves up to 140 times better safety than existing methods at the same performance levels. This work enables AI systems to achieve both rigorous safety guarantees and high performance across diverse domains.

Related papers

Towards provable probabilistic safety for scalable embodied AI systems [79.31011047593492]
Embodied AI systems are increasingly prevalent across various applications.<n> Ensuring their safety in complex operating environments remains a major challenge.<n>This Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
arXiv Detail & Related papers (2025-06-05T15:46:25Z)
Feasibility-Aware Pessimistic Estimation: Toward Long-Horizon Safety in Offline RL [14.767273209148545]
We propose a novel framework Feasibility-Aware offline Safe Reinforcement Learning with CVAE-based Pessimism (FASP)<n>We employ Hamilton-Jacobi (H-J) reachability analysis to generate reliable safety labels.<n>We also utilize pessimistic estimation methods to estimate the Q-value of reward and cost.
arXiv Detail & Related papers (2025-05-13T02:32:49Z)
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement [73.0700818105842]
We introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques. We conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness.
arXiv Detail & Related papers (2025-02-24T02:11:52Z)
A Physics-Informed Machine Learning Framework for Safe and Optimal Control of Autonomous Systems [8.347548017994178]
Safety and performance could be competing objectives, which makes their co-optimization difficult.<n>We propose a state-constrained optimal control problem, where performance objectives are encoded via a cost function and safety requirements are imposed as state constraints.<n>We demonstrate that the resultant value function satisfies a Hamilton-Jacobi-Bellman equation, which we approximate efficiently using a novel machine learning framework.
arXiv Detail & Related papers (2025-02-16T09:46:17Z)
Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system.<n>This raises the question of what level of confidence should be associated with a top-level claim.<n>We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z)
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z)
Scaling #DNN-Verification Tools with Efficient Bound Propagation and Parallel Computing [57.49021927832259]
Deep Neural Networks (DNNs) are powerful tools that have shown extraordinary results in many scenarios. However, their intricate designs and lack of transparency raise safety concerns when applied in real-world applications. Formal Verification (FV) of DNNs has emerged as a valuable solution to provide provable guarantees on the safety aspect.
arXiv Detail & Related papers (2023-12-10T13:51:25Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL. We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
Information-Theoretic Safe Exploration with Gaussian Processes [89.31922008981735]
We consider a sequential decision making task where we are not allowed to evaluate parameters that violate an unknown (safety) constraint. Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case. We propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate.
arXiv Detail & Related papers (2022-12-09T15:23:58Z)
Meta-Learning Priors for Safe Bayesian Optimization [72.8349503901712]
We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity. As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner. On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches.
arXiv Detail & Related papers (2022-10-03T08:38:38Z)
Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers. We then present the pointwise feasibility conditions of the resulting safety controller. We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z)
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial. Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size. We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
Neural Bridge Sampling for Evaluating Safety-Critical Autonomous Systems [34.945482759378734]
We employ a probabilistic approach to safety evaluation in simulation, where we are concerned with computing the probability of dangerous events. We develop a novel rare-event simulation method that combines exploration, exploitation, and optimization techniques to find failure modes and estimate their rate of occurrence.
arXiv Detail & Related papers (2020-08-24T17:46:27Z)
Efficient statistical validation with edge cases to evaluate Highly Automated Vehicles [6.198523595657983]
The widescale deployment of Autonomous Vehicles seems to be imminent despite many safety challenges that are yet to be resolved. Existing standards focus on deterministic processes where the validation requires only a set of test cases that cover the requirements. This paper presents a new approach to compute the statistical characteristics of a system's behaviour by biasing automatically generated test cases towards the worst case scenarios.
arXiv Detail & Related papers (2020-03-04T04:35:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.