UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
- URL: http://arxiv.org/abs/2506.09289v1
- Date: Tue, 10 Jun 2025 22:56:49 GMT
- Title: UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
- Authors: Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang,
- Abstract summary: We introduce UTGenerator, an LLM-driven test case generator to generate test cases for real-world Python projects.<n>Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation.<n>In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench.
- Score: 8.00058513405915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench. These corrections, impacting 40.9% of SWE-Bench Lite and 24.4% of SWE-Bench Verified leaderboard entries, yield 18 and 11 ranking changes, respectively.
Related papers
- Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z) - Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.<n>We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - Automated Benchmark Generation for Repository-Level Coding Tasks [7.305342793164905]
SetUpAgent is a fully automated system capable of historically accurate dependency setup, test execution, and result parsing.<n>We generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries.
arXiv Detail & Related papers (2025-03-10T17:42:49Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.<n>We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.<n>We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - Large Language Models as Test Case Generators: Performance Evaluation and Enhancement [3.5398126682962587]
We study how well Large Language Models can generate high-quality test cases.
We propose a multi-agent framework called emphTestChain that decouples the generation of test inputs and test outputs.
Our results indicate that TestChain outperforms the baseline by a large margin.
arXiv Detail & Related papers (2024-04-20T10:27:01Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.