Related papers: When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents

URL: http://arxiv.org/abs/2602.11619v1
Date: Thu, 12 Feb 2026 06:15:14 GMT
Title: When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
Authors: Aman Mehta,
Abstract summary: ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs.<n>Tasks with consistent behavior achieve 80--92% accuracy, while highly inconsistent tasks achieve only 25--60%.<n>Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior ($\leq$2 unique paths) achieve 80--92% accuracy, while highly inconsistent tasks ($\geq$6 unique paths) achieve only 25--60%, a 32--55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.

Related papers

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows [0.0]
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agents.<n>It achieves 78-100% cost reduction while maintaining rigorous statistical guarantees.
arXiv Detail & Related papers (2026-03-03T04:59:25Z)
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks [0.38991526486631006]
We argue that many reliability failures are caused by drift from a task's latent solution structure, not capability failures.<n>We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction.
arXiv Detail & Related papers (2026-02-22T02:37:57Z)
Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance [4.424336158797069]
This paper compares five popular AI-powered coding assistants (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code)<n>Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks)<n>Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates.
arXiv Detail & Related papers (2026-02-09T17:14:46Z)
On Randomness in Agentic Evals [6.177270420667714]
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.<n>Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate.<n>We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected.
arXiv Detail & Related papers (2026-02-06T19:49:13Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents [52.20768003832476]
We analyze execution traces on $$-Bench (Airline/Retail) and SWE-Bench Verified.<n>We formalize emphdecisive deviations, earliest action, level divergences that flip success to failure.<n>We introduce cm, a model-agnostic, gradient-free, test-time safeguard.
arXiv Detail & Related papers (2025-11-26T01:28:22Z)
Multi-Agent Code Verification with Compound Vulnerability Detection [0.0]
Existing tools only catch 65% of bugs with 35% false positives.<n>We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs.
arXiv Detail & Related papers (2025-11-20T03:40:27Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems [20.846301581161978]
Failure attribution in multi-agent systems is a critical yet unsolved challenge.<n>Current methods treat this as a pattern recognition task over long conversation logs.<n>A2P Scaffolding transforms failure attribution from pattern recognition into a structured causal inference task.
arXiv Detail & Related papers (2025-09-12T16:51:15Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
MALT: Improving Reasoning with Multi-Agent LLM Training [67.76186488361685]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.