Related papers: ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

URL: http://arxiv.org/abs/2601.06112v1
Date: Sat, 03 Jan 2026 13:41:33 GMT
Title: ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions
Authors: Aayush Gupta,
Abstract summary: Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production.<n>We introduce textbfReliabilityBench, a benchmark for evaluating agent reliability across three dimensions.<n>We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes.
Score: 0.32928123659012326
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production. We introduce \textbf{ReliabilityBench}, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using $\mathrm{pass}^k$, (ii) robustness to semantically equivalent task perturbations at intensity $ε$, and (iii) fault tolerance under controlled tool/API failures at intensity $λ$. ReliabilityBench contributes a unified reliability surface $R(k,ε,λ)$, \textit{action metamorphic relations} that define correctness via end-state equivalence rather than text similarity, and a chaos-engineering-style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at $ε=0$ to 88.1% at $ε=0.2$. Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT-4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.

Related papers

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents [0.7699235580548228]
LLM agents struggle with regulatory audit replay: when asked to reproduce a transaction flagged decision with identical inputs, most deployments fail to return consistent results.<n>This paper introduces the DeterminismFaithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
arXiv Detail & Related papers (2026-01-17T19:47:55Z)
The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee [5.345468714252351]
This work bridges the gap by developing an LLM-Verifier Convergence Theorem.<n>We model the interaction between the LLM and the verifier as a discrete-time Markov Chain.<n>We stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials.
arXiv Detail & Related papers (2025-11-30T22:19:09Z)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents [52.20768003832476]
We analyze execution traces on $$-Bench (Airline/Retail) and SWE-Bench Verified.<n>We formalize emphdecisive deviations, earliest action, level divergences that flip success to failure.<n>We introduce cm, a model-agnostic, gradient-free, test-time safeguard.
arXiv Detail & Related papers (2025-11-26T01:28:22Z)
Alpha Divergence Losses for Biometric Verification [19.758259380263528]
We show that $$-divergence loss functions offer a compelling alternative to margin-based softmax losses.<n>We derivation two novel margin-based $$-divergence losses: Q-Margin and A3M.<n>Our models significantly outperform strong baselines at low false acceptance rates.
arXiv Detail & Related papers (2025-11-17T17:27:28Z)
Beyond Prompt Engineering: Neuro-Symbolic-Causal Architecture for Robust Multi-Objective AI Agents [0.0]
Large language models show promise as autonomous decision-making agents, yet their deployment in high-stakes domains remains fraught with risk.<n>We present Chimera, a neuro-symbolic-causal architecture that integrates an LLM strategist, a formally verified symbolic constraint engine, and a causal inference module for counterfactual reasoning.
arXiv Detail & Related papers (2025-10-27T15:25:35Z)
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI [5.165179548592513]
AgentChangeBench is a benchmark designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts.<n>Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and GoalShift Recovery Time (GSRT) for adaptation.
arXiv Detail & Related papers (2025-10-20T23:48:07Z)
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution [18.314436803012434]
The paper proposes MCTS-INE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimize intermediate reasoning steps.<n> Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.
arXiv Detail & Related papers (2025-06-15T05:42:01Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>The impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents on different downstream tasks.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.