Automated Unit Test Improvement using Large Language Models at Meta
- URL: http://arxiv.org/abs/2402.09171v1
- Date: Wed, 14 Feb 2024 13:43:14 GMT
- Title: Automated Unit Test Improvement using Large Language Models at Meta
- Authors: Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya,
Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang
- Abstract summary: This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests.
We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms.
- Score: 44.87533111512982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes Meta's TestGen-LLM tool, which uses LLMs to
automatically improve existing human-written tests. TestGen-LLM verifies that
its generated test classes successfully clear a set of filters that assure
measurable improvement over the original test suite, thereby eliminating
problems due to LLM hallucination. We describe the deployment of TestGen-LLM at
Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on
Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built
correctly, 57% passed reliably, and 25% increased coverage. During Meta's
Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which
it was applied, with 73% of its recommendations being accepted for production
deployment by Meta software engineers. We believe this is the first report on
industrial scale deployment of LLM-generated code backed by such assurances of
code improvement.
Related papers
- GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation.
SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2.
Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z) - Reinforcement Learning from Automatic Feedback for High-Quality Unit
Test Generation [13.658632458850144]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases.
LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices.
We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - An Empirical Evaluation of Using Large Language Models for Automated
Unit Test Generation [3.9762912548964864]
This paper presents a large-scale empirical evaluation on the effectiveness of Large Language Models for automated unit test generation.
We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package.
We find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests.
arXiv Detail & Related papers (2023-02-13T17:13:41Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.