Automated Unit Test Improvement using Large Language Models at Meta
- URL: http://arxiv.org/abs/2402.09171v1
- Date: Wed, 14 Feb 2024 13:43:14 GMT
- Title: Automated Unit Test Improvement using Large Language Models at Meta
- Authors: Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya,
Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang
- Abstract summary: This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests.
We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms.
- Score: 44.87533111512982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes Meta's TestGen-LLM tool, which uses LLMs to
automatically improve existing human-written tests. TestGen-LLM verifies that
its generated test classes successfully clear a set of filters that assure
measurable improvement over the original test suite, thereby eliminating
problems due to LLM hallucination. We describe the deployment of TestGen-LLM at
Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on
Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built
correctly, 57% passed reliably, and 25% increased coverage. During Meta's
Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which
it was applied, with 73% of its recommendations being accepted for production
deployment by Meta software engineers. We believe this is the first report on
industrial scale deployment of LLM-generated code backed by such assurances of
code improvement.
Related papers
- AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly.
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.
Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - Improving LLM-based Unit test generation via Template-based Repair [8.22619177301814]
Unit test is crucial for detecting bugs in individual program units but consumes time and effort.
Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities.
In this paper, we propose TestART, a novel unit test generation method.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - CoverUp: Coverage-Guided LLM-Based Test Generation [0.7673339435080445]
CoverUp is a novel approach to driving the generation of high-coverage Python regression tests.
We show that CoverUp's iterative, coverage-guided approach is crucial to its effectiveness.
arXiv Detail & Related papers (2024-03-24T16:18:27Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - An Empirical Evaluation of Using Large Language Models for Automated
Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation.
We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package.
We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.