Related papers: StepShield: When, Not Whether to Intervene on Rogue Agents

StepShield: When, Not Whether to Intervene on Rogue Agents

URL: http://arxiv.org/abs/2601.22136v1
Date: Thu, 29 Jan 2026 18:55:46 GMT
Title: StepShield: When, Not Whether to Intervene on Rogue Agents
Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli,
Abstract summary: Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis.<n>We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether.<n>By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents.
Score: 1.472404880217315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

Related papers

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows [0.0]
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agents.<n>It achieves 78-100% cost reduction while maintaining rigorous statistical guarantees.
arXiv Detail & Related papers (2026-03-03T04:59:25Z)
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents [50.5814495434565]
This work makes the first effort to define and study misaligned action detection in computer-use agents (CUAs)<n>We identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels.<n>We propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
arXiv Detail & Related papers (2026-02-09T18:41:15Z)
OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence [0.0]
We introduce OpenSec, a dual-control reinforcement learning environment that evaluates defensive incident response agents.<n>Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence.<n>We find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates.
arXiv Detail & Related papers (2026-01-28T22:12:54Z)
An Effective and Cost-Efficient Agentic Framework for Ethereum Smart Contract Auditing [8.735899453872966]
Heimdallr is an automated auditing agent designed to overcome hurdles through four core innovations.<n>It minimizes context overhead while preserving essential business logic.<n>It then employs reasoning to detect complex vulnerabilities and automatically chain functional exploits.
arXiv Detail & Related papers (2026-01-25T13:28:37Z)
Can We Predict Before Executing Machine Learning Agents? [74.39460101251792]
We formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons.<n>We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report.<n>We instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%.
arXiv Detail & Related papers (2026-01-09T16:44:17Z)
Trajectory Guard -- A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI [0.0]
Trajectory Guard is a Siamese Recurrent Autoencoder with a hybrid loss function that jointly learns task-trajectory alignment via contrastive learning and sequential validity via reconstruction.<n>At 32 ms latency, our approach runs 17$-27times$ faster than LLM Judge baselines, enabling real-time safety verification in production deployments.
arXiv Detail & Related papers (2026-01-02T00:27:11Z)
AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z)
SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports [8.545800179148442]
SEBERTIS is a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues.<n>Our framework achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports.
arXiv Detail & Related papers (2025-12-17T01:23:11Z)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents [52.20768003832476]
We analyze execution traces on $$-Bench (Airline/Retail) and SWE-Bench Verified.<n>We formalize emphdecisive deviations, earliest action, level divergences that flip success to failure.<n>We introduce cm, a model-agnostic, gradient-free, test-time safeguard.
arXiv Detail & Related papers (2025-11-26T01:28:22Z)
LLM-Powered Detection of Price Manipulation in DeFi [12.59175486585742]
Decentralized Finance (DeFi) smart contracts manage billions of dollars, making them a prime target for exploits.<n>Price manipulation vulnerabilities, often via flash loans, are a devastating class of attacks causing significant financial losses.<n>We propose PMDetector, a hybrid framework combining static analysis with Large Language Model (LLM)-based reasoning.
arXiv Detail & Related papers (2025-10-24T09:13:30Z)
Online Fair Division for Personalized $2$-Value Instances [51.278096593080456]
We study an online fair division setting, where goods arrive one at a time and there is a fixed set of $n$ agents.<n>Once a good appears, the value each agent has for it is revealed and it must be allocated immediately and irrevocably to one of the agents.<n>We show how to obtain worst case guarantees with respect to well-known fairness notions.
arXiv Detail & Related papers (2025-05-28T09:48:16Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
Detection as Regression: Certified Object Detection by Median Smoothing [50.89591634725045]
This work is motivated by recent progress on certified classification by randomized smoothing. We obtain the first model-agnostic, training-free, and certified defense for object detection against $ell$-bounded attacks.
arXiv Detail & Related papers (2020-07-07T18:40:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.