Related papers: Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

URL: http://arxiv.org/abs/2510.09781v1
Date: Fri, 10 Oct 2025 18:42:32 GMT
Title: Building a Foundational Guardrail for General Agentic Systems via Synthetic Data
Authors: Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang,
Abstract summary: LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm.<n>Existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level.<n>We introduce AuraGen, a controllable engine that synthesizes benign trajectories, injects category-labeled risks with difficulty, and filters outputs via an automated reward model.
Score: 76.18834864749606
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Related papers

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use [6.622648583261088]
Agentic language models must plan, call tools, and execute long-horizon actions where a single misstep can cause irreversible harm.<n>We introduce MOSAIC, a framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable.<n>We show that MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance.
arXiv Detail & Related papers (2026-03-03T17:59:35Z)
TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z)
Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z)
Self-Guard: Defending Large Reasoning Models via enhanced self-reflection [54.775612141528164]
Self-Guard is a lightweight safety defense framework for Large Reasoning Models.<n>It bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility.<n>Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales.
arXiv Detail & Related papers (2026-01-31T13:06:11Z)
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security [126.49733412191416]
Current guardrail models lack agentic risk awareness and transparency in risk diagnosis.<n>We propose a unified three-dimensional taxonomy that categorizes agentic risks by their source (where), failure mode (how), and consequence (what)<n>We introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG)
arXiv Detail & Related papers (2026-01-26T13:45:41Z)
SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment [43.86865924673546]
We propose SafeThinker, an adaptive framework that allocates defensive resources via a lightweight gateway classifier.<n> Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising robustness.
arXiv Detail & Related papers (2026-01-23T07:12:53Z)
Black-Box Guardrail Reverse-engineering Attack [12.937652779951156]
We present the first study of black-box LLM guardrail reverse-engineering attacks.<n>We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework.<n>GRA achieves a rule matching rate exceeding 0.92 while requiring less than $85 in API costs.
arXiv Detail & Related papers (2025-11-06T09:24:49Z)
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z)
Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling [74.41886258801209]
We propose a two-stage trajectory planning framework that decouples principle alignment from behavior learning.<n>Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-05-23T09:22:19Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts.<n>We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter.<n>We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
BadGD: A unified data-centric framework to identify gradient descent vulnerabilities [10.996626204702189]
BadGD sets a new standard for understanding and mitigating adversarial manipulations. This research underscores the severe threats posed by such data-centric attacks and highlights the urgent need for robust defenses in machine learning.
arXiv Detail & Related papers (2024-05-24T23:39:45Z)
Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects [27.41101006357176]
In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. operating based on an incomplete model can often produce unintended negative side effects (NSEs)
arXiv Detail & Related papers (2023-04-06T14:03:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.