Related papers: From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

URL: http://arxiv.org/abs/2510.13727v1
Date: Wed, 15 Oct 2025 16:30:57 GMT
Title: From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
Authors: Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernández Fisac, Andrea Bajcsy,
Abstract summary: Most AI guardrails rely on output classification based on labeled datasets and human-specified criteria.<n>We build predictive guardrails that monitor an AI system's outputs in real time and proactively correct risky outputs to safe ones.<n>Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer agents clear of catastrophic outcomes.
Score: 12.84192844049763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.

Related papers

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective [31.55000083809067]
We show how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.<n>We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments.
arXiv Detail & Related papers (2026-02-06T23:20:26Z)
Can AI Perceive Physical Danger and Intervene? [16.825608691806988]
New safety challenges emerge when AI interacts with the physical world.<n>How well do state-of-the-art foundation models understand common-sense facts about physical safety?
arXiv Detail & Related papers (2025-09-25T22:09:17Z)
ANNIE: Be Careful of Your Robots [48.89876809734855]
We present the first systematic study of adversarial safety attacks on embodied AI systems.<n>We show attack success rates exceeding 50% across all safety categories.<n>Results expose a previously underexplored but highly consequential attack surface in embodied AI systems.
arXiv Detail & Related papers (2025-09-03T15:00:28Z)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z)
Learning to Drive Ethically: Embedding Moral Reasoning into Autonomous Driving [1.2891210250935148]
We present a hierarchical Safe Reinforcement Learning (Safe RL) framework that explicitly integrates moral considerations with standard driving objectives.<n>At the decision level, a Safe RL agent is trained using a composite ethical risk cost, combining collision probability and harm severity, to generate high-level motion targets.<n>At the execution level, path planning coupled with Proportional-Integral-Derivative (PID) controllers translates these targets into smooth, feasible trajectories.
arXiv Detail & Related papers (2025-08-19T14:24:02Z)
SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z)
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z)
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? [37.13209023718946]
Unchecked AI agency poses significant risks to public safety and security.<n>We discuss how these risks arise from current AI training methods.<n>We propose a core building block for further advances the development of a non-agentic AI system.
arXiv Detail & Related papers (2025-02-21T18:28:36Z)
MORTAR: A Model-based Runtime Action Repair Framework for AI-enabled Cyber-Physical Systems [21.693552236958983]
Cyber-Physical Systems (CPSs) are increasingly prevalent across various industrial and daily-life domains. With recent advancements in artificial intelligence (AI), learning-based components, especially AI controllers, have become essential in enhancing the functionality and efficiency of CPSs. The lack of interpretability in these AI controllers presents challenges to the safety and quality assurance of AI-enabled CPSs (AI-CPSs)
arXiv Detail & Related papers (2024-08-07T16:44:53Z)
Work-in-Progress: Crash Course: Can (Under Attack) Autonomous Driving Beat Human Drivers? [60.51287814584477]
This paper evaluates the inherent risks in autonomous driving by examining the current landscape of AVs. We develop specific claims highlighting the delicate balance between the advantages of AVs and potential security challenges in real-world scenarios.
arXiv Detail & Related papers (2024-05-14T09:42:21Z)
AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles [76.46575807165729]
We propose AdvSim, an adversarial framework to generate safety-critical scenarios for any LiDAR-based autonomy system. By simulating directly from sensor data, we obtain adversarial scenarios that are safety-critical for the full autonomy stack.
arXiv Detail & Related papers (2021-01-16T23:23:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.