UTFix: Change Aware Unit Test Repairing using LLM
- URL: http://arxiv.org/abs/2503.14924v1
- Date: Wed, 19 Mar 2025 06:10:03 GMT
- Title: UTFix: Change Aware Unit Test Repairing using LLM
- Authors: Shanto Rahman, Sachit Kuhar, Berk Cirisci, Pranav Garg, Shiqi Wang, Xiaofei Ma, Anoop Deoras, Baishakhi Ray,
- Abstract summary: We present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes.<n>Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages.<n>To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects.
- Score: 24.12850207529614
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Software updates, including bug repair and feature additions, are frequent in modern applications but they often leave test suites outdated, resulting in undetected bugs and increased chances of system failures. A recent study by Meta revealed that 14%-22% of software failures stem from outdated tests that fail to reflect changes in the codebase. This highlights the need to keep tests in sync with code changes to ensure software reliability. In this paper, we present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes. UTFix addresses two critical issues: assertion failure and reduced code coverage caused by changes in the focal method. Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages. We evaluate UTFix on our generated synthetic benchmarks (Tool-Bench), and real-world benchmarks. Tool- Bench includes diverse changes from popular open-source Python GitHub projects, where UTFix successfully repaired 89.2% of assertion failures and achieved 100% code coverage for 96 tests out of 369 tests. On the real-world benchmarks, UTFix repairs 60% of assertion failures while achieving 100% code coverage for 19 out of 30 unit tests. To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects. Our contributions include the development of UTFix, the creation of Tool-Bench and real-world benchmarks, and the demonstration of the effectiveness of LLM-based methods in addressing unit test failures due to software evolution.
Related papers
- Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.<n>It covers initial tests authoring, test suite completion, and code coverage improvements.<n>We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - Fix the Tests: Augmenting LLMs to Repair Test Cases with Static Collector and Neural Reranker [9.428021853841296]
We propose SYNTER, a novel approach to automatically repair obsolete test cases via precise and concise TROCtxs construction.
With the augmentation of constructed TROCtxs, hallucinations are reduced by 57.1%.
arXiv Detail & Related papers (2024-07-04T04:24:43Z) - ContrastRepair: Enhancing Conversation-Based Automated Program Repair
via Contrastive Test Case Pairs [23.419180504723546]
ContrastRepair is a novel APR approach that augments conversation-driven APR by providing contrastive test pairs.
We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java.
arXiv Detail & Related papers (2024-03-04T12:15:28Z) - Automated Unit Test Improvement using Large Language Models at Meta [44.87533111512982]
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests.
We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms.
arXiv Detail & Related papers (2024-02-14T13:43:14Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z) - SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video
Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems.
We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub.
The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z) - Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas.
We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data.
Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data.
BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.