A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks
- URL: http://arxiv.org/abs/2511.00872v1
- Date: Sun, 02 Nov 2025 09:46:59 GMT
- Title: A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks
- Authors: Zhuowen Yin, Cuifeng Gao, Chunsong Fan, Wenzhang Yang, Yinxing Xue, Lijun Zhang,
- Abstract summary: We evaluate seven general-purpose agent frameworks across three representative code-centric tasks.<n>Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks.<n>For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient.
- Score: 14.762911285395047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on specific tasks or isolated aspects, providing an incomplete picture of agents' practical capabilities. To address this, we conduct a comprehensive empirical study evaluating seven general-purpose agent frameworks across three representative code-centric tasks: software development, vulnerability detection, and program repair. Each task is assessed using standard, widely adopted benchmarks to ensure objective and comparable evaluation. Agent performance is systematically analyzed from three complementary perspectives: effectiveness (task success), efficiency (execution process), and overhead (token consumption). Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks. In terms of effectiveness, agents achieve moderate overall performance. Regarding efficiency, AgentOrchestra tends to exhibit the longest trajectories and the most correction attempts due to coordination overhead, whereas OpenHands demonstrate stronger reflective reasoning abilities. For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient. Furthermore, we conduct an in-depth cross-analysis of the relationship between effectiveness and efficiency, exploring the underlying reasons behind their interplay. These findings guide both practical adoption and future research toward more efficient software engineering agents.
Related papers
- Toward Efficient Agents: Memory, Tool learning, and Planning [96.93533945696156]
This paper investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc.
arXiv Detail & Related papers (2026-01-20T17:51:56Z) - LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking [60.35109192765302]
Information seeking is a core capability that enables autonomous reasoning and decision-making.<n>We propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories.<n>Our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
arXiv Detail & Related papers (2025-10-28T17:51:42Z) - How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z) - Benchmarking LLM-based Agents for Single-cell Omics Analysis [6.915378212190715]
AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion.<n>We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis.
arXiv Detail & Related papers (2025-08-16T04:26:18Z) - Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study [15.97770416681533]
Software engineering agents (SWE agents) operate autonomously by interpreting user input and responding to environmental feedback.<n>We present the first systematic study of SWE agent behavior through the lens of execution traces.
arXiv Detail & Related papers (2025-06-10T00:41:54Z) - Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z) - Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration [63.90193684394165]
We introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation.<n>During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards.<n>During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step.
arXiv Detail & Related papers (2025-05-29T07:24:37Z) - Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training.
Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z) - Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models [95.96734086126469]
Large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications.
For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work.
We perform a detailed coarse-to-fine analysis of the inference performance of various code libraries.
arXiv Detail & Related papers (2024-04-17T15:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.