TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get   Resolved?
        - URL: http://arxiv.org/abs/2412.02883v1
- Date: Tue, 03 Dec 2024 22:38:05 GMT
- Title: TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get   Resolved?
- Authors: Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, 
- Abstract summary: Test-driven development (TDD) is the practice of writing tests first and coding later.<n>This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories.
- Score: 11.762669773233474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Test-driven development (TDD) is the practice of writing tests first and coding later, and the proponents of TDD expound its numerous benefits. For instance, given an issue on a source code repository, tests can clarify the desired behavior among stake-holders before anyone writes code for the agreed-upon fix. Although there has been a lot of work on automated test generation for the practice "write code first, test later", there has been little such automation for TDD. Ideally, tests for TDD should be fail-to-pass (i.e., fail before the issue is resolved and pass after) and have good adequacy with respect to covering the code changed during issue resolution. This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories. The benchmark's evaluation harness runs only relevant tests in isolation for simple yet accurate coverage measurements, and the benchmark's dataset is filtered both by human judges and by execution in the harness. This paper also presents Auto-TDD, an LLM-based solution that takes as input an issue description and a codebase (prior to issue resolution) and returns as output a test that can be used to validate the changes made for resolving the issue. Our evaluation shows that Auto-TDD yields a better fail-to-pass rate than the strongest prior work while also yielding high coverage adequacy. Overall, we hope that this work helps make developers more productive at resolving issues while simultaneously leading to more robust fixes. 
 
      
        Related papers
        - CLEVER: A Curated Benchmark for Formally Verified Code Generation [57.476483009565044]
 $rm Csmall LEVER$ is a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean.<n>Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification.
 arXiv  Detail & Related papers  (2025-05-20T05:15:47Z)
- Studying the Impact of Early Test Termination Due to Assertion Failure   on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
 This study is the first empirical study on early test termination due to assertion failure.
We investigated 207 versions of 6 open-source projects.
Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
 arXiv  Detail & Related papers  (2025-04-06T17:14:09Z)
- KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for   Coding [49.56049319037421]
 KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.
It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.
This pipeline yields a large-scale, robust and diverse coding dataset.
 arXiv  Detail & Related papers  (2025-03-04T19:17:36Z)
- Otter: Generating Tests from Issues to Validate SWE Patches [12.353105297285802]
 This paper introduces Otter, an LLM-based solution for generating tests from issues.
Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage.
 Experiments show Otter outperforming state-of-the-art systems for generating tests from issues.
 arXiv  Detail & Related papers  (2025-02-07T22:41:31Z)
- Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
 Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)
We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.
We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
 arXiv  Detail & Related papers  (2025-02-03T18:51:43Z)
- TestGenEval: A Real World Unit Test Generation and Test Completion   Benchmark [24.14654309612826]
 TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
 arXiv  Detail & Related papers  (2024-10-01T14:47:05Z)
- SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
 We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.
We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.
We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
 arXiv  Detail & Related papers  (2024-06-18T14:54:37Z)
- Test-Driven Development for Code Generation [0.850206009406913]
 Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements.
This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
 arXiv  Detail & Related papers  (2024-02-21T04:10:12Z)
- PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation [20.441921569948562]
 Test-driven development (TDD) mandates writing test cases based on requirements before writing the actual code.
While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers.
We introduce PyTester, a Text-to-Testcase generation approach that can automatically generate correct, executable, complete, and effective test cases.
 arXiv  Detail & Related papers  (2024-01-15T10:21:58Z)
- AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
 Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
 arXiv  Detail & Related papers  (2023-04-25T04:23:13Z)
- Learning Deep Semantics for Test Completion [46.842174440120196]
 We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test.
We develop TeCo -- a deep learning model using code semantics for test completion.
 arXiv  Detail & Related papers  (2023-02-20T18:53:56Z)
- Sequential Kernelized Independence Testing [101.22966794822084]
 We design sequential kernelized independence tests inspired by kernelized dependence measures.
We demonstrate the power of our approaches on both simulated and real data.
 arXiv  Detail & Related papers  (2022-12-14T18:08:42Z)
- T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
 We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
 arXiv  Detail & Related papers  (2022-12-12T06:29:04Z)
- CodeT: Code Generation with Generated Tests [49.622590050797236]
 We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
 arXiv  Detail & Related papers  (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.