When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
- URL: http://arxiv.org/abs/2602.11619v1
- Date: Thu, 12 Feb 2026 06:15:14 GMT
- Title: When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents
- Authors: Aman Mehta,
- Abstract summary: ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs.<n>Tasks with consistent behavior achieve 80--92% accuracy, while highly inconsistent tasks achieve only 25--60%.<n>Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Run the same LLM agent on the same task twice: do you get the same behavior? We find the answer is often no. In a study of 3,000 agent runs across three models (Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5) on HotpotQA, we observe that ReAct-style agents produce 2.0--4.2 distinct action sequences per 10 runs on average, even with identical inputs. More importantly, this variance predicts failure: tasks with consistent behavior ($\leq$2 unique paths) achieve 80--92% accuracy, while highly inconsistent tasks ($\geq$6 unique paths) achieve only 25--60%, a 32--55 percentage point gap depending on model. We trace variance to early decisions: 69% of divergence occurs at step 2, the first search query. Our results suggest that monitoring behavioral consistency during execution could enable early error detection and improve agent reliability.
Related papers
- AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows [0.0]
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agents.<n>It achieves 78-100% cost reduction while maintaining rigorous statistical guarantees.
arXiv Detail & Related papers (2026-03-03T04:59:25Z) - Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks [0.38991526486631006]
We argue that many reliability failures are caused by drift from a task's latent solution structure, not capability failures.<n>We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction.
arXiv Detail & Related papers (2026-02-22T02:37:57Z) - Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance [4.424336158797069]
This paper compares five popular AI-powered coding assistants (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code)<n>Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks)<n>Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates.
arXiv Detail & Related papers (2026-02-09T17:14:46Z) - On Randomness in Agentic Evals [6.177270420667714]
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.<n>Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate.<n>We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected.
arXiv Detail & Related papers (2026-02-06T19:49:13Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents [52.20768003832476]
We analyze execution traces on $$-Bench (Airline/Retail) and SWE-Bench Verified.<n>We formalize emphdecisive deviations, earliest action, level divergences that flip success to failure.<n>We introduce cm, a model-agnostic, gradient-free, test-time safeguard.
arXiv Detail & Related papers (2025-11-26T01:28:22Z) - Multi-Agent Code Verification with Compound Vulnerability Detection [0.0]
Existing tools only catch 65% of bugs with 35% false positives.<n>We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs.
arXiv Detail & Related papers (2025-11-20T03:40:27Z) - AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z) - Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems [20.846301581161978]
Failure attribution in multi-agent systems is a critical yet unsolved challenge.<n>Current methods treat this as a pattern recognition task over long conversation logs.<n>A2P Scaffolding transforms failure attribution from pattern recognition into a structured causal inference task.
arXiv Detail & Related papers (2025-09-12T16:51:15Z) - Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z) - Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - MALT: Improving Reasoning with Multi-Agent LLM Training [67.76186488361685]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.