Related papers: Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

URL: http://arxiv.org/abs/2603.03116v1
Date: Tue, 03 Mar 2026 15:47:41 GMT
Title: Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
Authors: Hongliu Cao, Ilias Driouich, Eoin Thomas,
Abstract summary: Procedure-Aware Evaluation (PAE) is a framework that formalizes agent procedures as structured observations.<n>We evaluate state-of-the-art Large Language Model (LLM)-based agents on tau-bench.
Score: 2.102846336724103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.

Related papers

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models [5.733004743054914]
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process.<n>We introduce a formal framework for reasoning faithfulness, defined by two testable conditions.<n>We present RFEval, a benchmark of 7,186 instances that probes faithfulness via controlled, output-level counterfactual interventions.
arXiv Detail & Related papers (2026-02-19T03:49:37Z)
Towards a Science of AI Agent Reliability [9.570634569436535]
AI agents are increasingly deployed to execute important tasks.<n>While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.<n>We propose twelve metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety.
arXiv Detail & Related papers (2026-02-18T18:05:44Z)
FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight [21.731032636844237]
This paper proposes a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture.<n>We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection.
arXiv Detail & Related papers (2026-02-11T18:48:11Z)
Verified Critical Step Optimization for LLM Agents [67.05296684575445]
Critical Step Optimization focuses preference learning on verified critical steps.<n>Method starts from failed policy trajectories rather than expert demonstrations.<n>Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline.
arXiv Detail & Related papers (2026-02-03T11:41:02Z)
AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains [3.721111684544962]
Hallucination in large language models (LLMs) contributes to spread of misinformation and diminished public trust.<n>We introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality.<n>We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates.
arXiv Detail & Related papers (2026-01-21T22:47:59Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Making LLMs Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions [51.56484100374058]
Current large language models (LLMs) excel in verifiable domains where outputs can be checked before action but prove less reliable for high-stakes strategic decisions with uncertain outcomes.<n>This gap, driven by mutually cognitive biases in both humans and artificial intelligence (AI) systems, threatens the defensibility of valuations and sustainability of investments in the sector.<n>This report describes a framework emerging from systematic qualitative assessment across 7 frontier-grade LLMs and 3 market-facing venture vignettes under time pressure.
arXiv Detail & Related papers (2025-11-10T22:24:21Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.