Related papers: Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

URL: http://arxiv.org/abs/2511.00197v1
Date: Fri, 31 Oct 2025 18:58:13 GMT
Title: Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories
Authors: Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, He Ye,
Abstract summary: This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues.<n>We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts.
Score: 10.751728274263536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81\% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

Related papers

The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z)
Process-Centric Analysis of Agentic Software Systems [10.976178600911263]
We introduce Graphectory to encode the temporal and semantic relations in software systems.<n>We analyze 4000 trajectories of two dominant agentic programming models.<n>Our fully automated analyses reveal that agents using richer prompts exhibit more complex Graphectory.
arXiv Detail & Related papers (2025-12-02T04:12:29Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering [51.07491603393163]
tAgent is a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals.<n>By leveraging soft supervision and weighted aggregation of agent outputs, Agent learns principled collaboration schemes that capture the complementary strengths of diverse agents.
arXiv Detail & Related papers (2025-10-06T23:20:49Z)
An Empirical Study on Failures in Automated Issue Solving [12.571536148821144]
We analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified.<n>To move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances.<n>The results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks.
arXiv Detail & Related papers (2025-09-17T13:07:52Z)
SWE-Exp: Experience-Driven Software Issue Resolution [19.525080502900785]
We introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories.<n>Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts.<n>Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified.
arXiv Detail & Related papers (2025-07-31T09:13:42Z)
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories [17.975121612118752]
Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks.<n>We present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents.<n>We identify key trajectory characteristics, such as counts and token consumption, recurring action sequences, and the semantic coherence of thoughts, actions, and their results.
arXiv Detail & Related papers (2025-06-23T16:34:52Z)
Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study [15.97770416681533]
Software engineering agents (SWE agents) operate autonomously by interpreting user input and responding to environmental feedback.<n>We present the first systematic study of SWE agent behavior through the lens of execution traces.
arXiv Detail & Related papers (2025-06-10T00:41:54Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Why Do Multi-Agent LLM Systems Fail? [87.90075668488434]
We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks.<n>We build the first Multi-Agent System Failure taxonomy (MAST)<n>We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent)
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z)
Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.