Related papers: OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

URL: http://arxiv.org/abs/2601.21083v2
Date: Fri, 30 Jan 2026 21:01:32 GMT
Title: OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
Authors: Jarrod Barnes,
Abstract summary: We introduce OpenSec, a dual-control reinforcement learning environment that evaluates defensive incident response agents.<n>Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence.<n>We find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at https://github.com/jbarnes850/opensec-env.

Related papers

Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis [0.0]
Adversarial comments produce small, statistically non-significant effects on detection accuracy.<n>Complex adversarial strategies offer no advantage over simple manipulative comments.<n>Comment stripping reduces detection for weaker models by removing helpful context.
arXiv Detail & Related papers (2026-02-18T00:34:17Z)
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents [50.5814495434565]
This work makes the first effort to define and study misaligned action detection in computer-use agents (CUAs)<n>We identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels.<n>We propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
arXiv Detail & Related papers (2026-02-09T18:41:15Z)
StepShield: When, Not Whether to Intervene on Rogue Agents [1.472404880217315]
Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis.<n>We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether.<n>By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents.
arXiv Detail & Related papers (2026-01-29T18:55:46Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
DepRadar: Agentic Coordination for Context Aware Defect Impact Analysis in Deep Learning Libraries [12.07621297131295]
DepRadar is an agent coordination framework for fine grained defect and impact analysis in DL library updates.<n>It integrates static analysis with DL-specific domain rules for defect reasoning and client side tracing.<n>On 122 client programs, DepRadar identifies affected cases with 90% recall and 80% precision, substantially outperforming other baselines.
arXiv Detail & Related papers (2026-01-14T12:41:39Z)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents [52.20768003832476]
We analyze execution traces on $$-Bench (Airline/Retail) and SWE-Bench Verified.<n>We formalize emphdecisive deviations, earliest action, level divergences that flip success to failure.<n>We introduce cm, a model-agnostic, gradient-free, test-time safeguard.
arXiv Detail & Related papers (2025-11-26T01:28:22Z)
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation [0.0]
This paper presents an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt.<n>We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected.<n>Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution.
arXiv Detail & Related papers (2025-11-24T05:42:54Z)
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks [11.371490212283383]
Code-capable large language model (LLM) agents are embedded into software engineering where they can read, write, and execute code.<n>We present JAWS-BENCH, a benchmark spanning three escalating workspaces that mirror attacker capability.<n>We find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end.
arXiv Detail & Related papers (2025-10-01T18:38:20Z)
SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code [46.20378145112059]
Post-hoc repair pipelines detect such faults only after execution.<n>We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision.
arXiv Detail & Related papers (2025-09-29T09:21:32Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Securing LLM-Generated Embedded Firmware through AI Agent-Driven Validation and Patching [0.9582466286528458]
Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real-time performance constraints.<n>This paper proposes a three-phase methodology that combines LLM-based firmware generation with automated security validation and iterative refinement.
arXiv Detail & Related papers (2025-09-12T05:15:35Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.