Related papers: Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

URL: http://arxiv.org/abs/2410.12468v1
Date: Wed, 16 Oct 2024 11:33:57 GMT
Title: Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios
Authors: Zhi Chen, Lingxiao Jiang,
Abstract summary: This study evaluated 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues. No single agent dominated, with 170 issues unresolved, indicating room for improvement. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities. Some agents increased code complexity, many reduced code duplication and minimized code smells.
Score: 13.949319911378826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark's test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.

Related papers

AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation [0.0]
We propose a Python-based framework that uses multiple cooperating LLM-powered agents to automate software development tasks.<n>In AgentMesh, specialized agents - a Planner, Coder, Debugger, and Reviewer - work in concert to transform a high-level requirement into fully realized code.
arXiv Detail & Related papers (2025-07-26T10:10:02Z)
From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Can Agents Fix Agent Issues? [11.464925706722982]
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming.<n>Maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements.<n>Recent software engineering (SE) agents have shown promise in addressing issues in traditional software systems, but it remains unclear how effectively they can resolve real-world issues in agent systems.
arXiv Detail & Related papers (2025-05-27T05:45:03Z)
R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution [60.80016554091364]
R&D-Agent is a dual-agent framework for iterative exploration.<n>The Researcher agent uses performance feedback to generate ideas, while the Developer agent refines code based on error feedback.<n>R&D-Agent is evaluated on MLE-Bench and emerges as the top-performing machine learning engineering agent.
arXiv Detail & Related papers (2025-05-20T06:07:00Z)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
MarsCode Agent: AI-native Automated Bug Fixing [7.909344108948294]
We introduce MarsCode Agent, a novel framework that leverages large language models to automatically identify and repair bugs in software code. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. Our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.
arXiv Detail & Related papers (2024-09-02T02:24:38Z)
OpenHands: An Open Platform for AI Software Developers as Generalist Agents [109.8507367518992]
We introduce OpenHands, a platform for the development of AI agents that interact with the world in similar ways to a human developer. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks.
arXiv Detail & Related papers (2024-07-23T17:50:43Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
Agentless: Demystifying LLM-based Software Engineering Agents [12.19683999553113]
We build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance and low cost.
arXiv Detail & Related papers (2024-07-01T17:24:45Z)
Agent-Driven Automatic Software Improvement [55.2480439325792]
This research proposal aims to explore innovative solutions by focusing on the deployment of agents powered by Large Language Models (LLMs) The iterative nature of agents, which allows for continuous learning and adaptation, can help surpass common challenges in code generation. We aim to use the iterative feedback in these systems to further fine-tune the LLMs underlying the agents, becoming better aligned to the task of automated software improvement.
arXiv Detail & Related papers (2024-06-24T15:45:22Z)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases. We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
AutoCodeRover: Autonomous Program Improvement [8.66280420062806]
We propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. Experiments on SWE-bench-lite (300 real-life GitHub issues) show increased efficacy in solving GitHub issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported SWE-agent.
arXiv Detail & Related papers (2024-04-08T11:55:09Z)
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges [41.038584732889895]
Large Language Models (LLMs) have shown promise in automated code generation but typically excel only in simpler tasks. Our research pivots towards evaluating LLMs in a more realistic setting -- real-world repo-level code generation. We present CodeAgent, a novel LLM-based agent framework that employs external tools for effective repo-level code generation.
arXiv Detail & Related papers (2024-01-14T18:12:03Z)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [80.52201658231895]
SWE-bench is an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. We show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
arXiv Detail & Related papers (2023-10-10T16:47:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.