Automated Unit Test Improvement using Large Language Models at Meta
- URL: http://arxiv.org/abs/2402.09171v1
- Date: Wed, 14 Feb 2024 13:43:14 GMT
- Title: Automated Unit Test Improvement using Large Language Models at Meta
- Authors: Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya,
Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang
- Abstract summary: This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests.
We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms.
- Score: 44.87533111512982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes Meta's TestGen-LLM tool, which uses LLMs to
automatically improve existing human-written tests. TestGen-LLM verifies that
its generated test classes successfully clear a set of filters that assure
measurable improvement over the original test suite, thereby eliminating
problems due to LLM hallucination. We describe the deployment of TestGen-LLM at
Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on
Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built
correctly, 57% passed reliably, and 25% increased coverage. During Meta's
Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which
it was applied, with 73% of its recommendations being accepted for production
deployment by Meta software engineers. We believe this is the first report on
industrial scale deployment of LLM-generated code backed by such assurances of
code improvement.
Related papers
- ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms [48.43237545197775]
Unit test generation has become a promising and important use case of LLMs.
ProjectTest is a project-level benchmark for unit test generation covering Python, Java, and JavaScript.
arXiv Detail & Related papers (2025-02-10T15:24:30Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM)
We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.
We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - AugmenTest: Enhancing Tests with LLM-Driven Oracles [2.159639193866661]
AugmenTest is an approach leveraging Large Language Models to infer correct test oracles based on available documentation of the software under test.
AugmenTest includes four variants: Simple Prompt, Extended Prompt, RAG with a generic prompt (without the context of class or method under test), and RAG with Simple Prompt, each offering different levels of contextual information to the LLMs.
Results show that in the most conservative scenario, AugmenTest's Extended Prompt consistently outperformed the Simple Prompt, achieving a success rate of 30% for generating correct assertions.
arXiv Detail & Related papers (2025-01-29T07:45:41Z) - LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs.
LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications.
arXiv Detail & Related papers (2025-01-15T05:51:20Z) - TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.
We propose TestART, a novel unit test generation method.
TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - CoverUp: Coverage-Guided LLM-Based Test Generation [0.7673339435080445]
CoverUp is a novel approach to driving the generation of high-coverage Python regression tests.
We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects.
arXiv Detail & Related papers (2024-03-24T16:18:27Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - An Empirical Evaluation of Using Large Language Models for Automated
Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation.
We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package.
We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.