Related papers: Execution-Feedback Driven Test Generation from SWE Issues

Execution-Feedback Driven Test Generation from SWE Issues

URL: http://arxiv.org/abs/2508.06365v1
Date: Fri, 08 Aug 2025 14:49:36 GMT
Title: Execution-Feedback Driven Test Generation from SWE Issues
Authors: Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel,
Abstract summary: This paper introduces novel techniques for leveraging execution feedback to get around this problem, implemented in a new reproduction test generator called e-Otter++.<n>Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.
Score: 8.685764659884367
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically. The primary challenge in this setting is that the code to be tested is either missing or wrong, as evidenced by the existence of the issue in the first place. This has held back test generation for this setting: without the correct code to execute, it is difficult to leverage execution feedback to generate good tests. This paper introduces novel techniques for leveraging execution feedback to get around this problem, implemented in a new reproduction test generator called e-Otter++. Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.

Related papers

ATGen: Adversarial Reinforcement Learning for Test Case Generation [78.48498301767079]
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
arXiv Detail & Related papers (2025-10-16T12:49:25Z)
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems [59.66823584073748]
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time.<n>We present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode.
arXiv Detail & Related papers (2025-10-15T07:16:28Z)
AutoCode: LLMs as Problem Setters for Competitive Programming [94.71566758494787]
We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases.<n>On held-out problems, AutoCode test suites approach 99% consistency with official judgments.
arXiv Detail & Related papers (2025-09-29T17:59:03Z)
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage [62.049868205196425]
AutoReproduce is a framework capable of automatically reproducing experiments described in research papers in an end-to-end manner.<n>Results show that AutoReproduce achieves an average performance gap of $22.1%$ on $89.74%$ of the executable experiment runs.
arXiv Detail & Related papers (2025-05-27T03:15:21Z)
Issue2Test: Generating Reproducing Test Cases from Issue Reports [21.28421180698285]
A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue.<n>This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report.<n>We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 30.4 of the issues.
arXiv Detail & Related papers (2025-03-20T16:44:00Z)
Otter: Generating Tests from Issues to Validate SWE Patches [12.353105297285802]
This paper introduces TDD-Bench-Verified, a benchmark for generating tests from issues, and Otter, an LLM-based solution for this task.<n>Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planner.
arXiv Detail & Related papers (2025-02-07T22:41:31Z)
Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z)
TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? [11.762669773233474]
Test-driven development (TDD) is the practice of writing tests first and coding later.<n>This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories.
arXiv Detail & Related papers (2024-12-03T22:38:05Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.<n>It covers initial tests authoring, test suite completion, and code coverage improvements.<n>We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Test-Driven Development for Code Generation [0.850206009406913]
Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
arXiv Detail & Related papers (2024-02-21T04:10:12Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z)
Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles. The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.