Using Large Language Models to Generate JUnit Tests: An Empirical Study
- URL: http://arxiv.org/abs/2305.00418v4
- Date: Sat, 9 Mar 2024 00:59:18 GMT
- Title: Using Large Language Models to Generate JUnit Tests: An Empirical Study
- Authors: Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir,
Noshin Ulfat, Fahmid Al Rifat, Vinicius Carvalho Lopes
- Abstract summary: A code generation model generates code by taking a prompt from a code comment, existing code, or a combination of both.
We investigated how well three models (Codex, GPT-3.5-Turbo, and StarCoder) can generate unit tests.
We found that the Codex model achieved above 80% coverage for the HumanEval dataset, but no model had more than 2% coverage for the EvoSuite SF110 benchmark.
- Score: 0.4788487793976782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A code generation model generates code by taking a prompt from a code
comment, existing code, or a combination of both. Although code generation
models (e.g., GitHub Copilot) are increasingly being adopted in practice, it is
unclear whether they can successfully be used for unit test generation without
fine-tuning for a strongly typed language like Java. To fill this gap, we
investigated how well three models (Codex, GPT-3.5-Turbo, and StarCoder) can
generate unit tests. We used two benchmarks (HumanEval and Evosuite SF110) to
investigate the effect of context generation on the unit test generation
process. We evaluated the models based on compilation rates, test correctness,
test coverage, and test smells. We found that the Codex model achieved above
80% coverage for the HumanEval dataset, but no model had more than 2% coverage
for the EvoSuite SF110 benchmark. The generated tests also suffered from test
smells, such as Duplicated Asserts and Empty Tests.
Related papers
- TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories.
It covers initial tests authoring, test suite completion, and code coverage improvements.
We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z) - CasModaTest: A Cascaded and Model-agnostic Self-directed Framework for Unit Test Generation [5.450831103980871]
CasModaTest is a cascaded, model-agnostic, and end-to-end unit test generation framework.
It generates test prefixes and test oracles and compiles or executes them to check their effectiveness.
arXiv Detail & Related papers (2024-06-22T05:52:39Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation [20.441921569948562]
Test-driven development (TDD) mandates writing test cases based on requirements before writing the actual code.
While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers.
We introduce PyTester, a Text-to-Testcase generation approach that can automatically generate correct, executable, complete, and effective test cases.
arXiv Detail & Related papers (2024-01-15T10:21:58Z) - TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion [15.13443954421825]
This paper introduces TestSpark, a plugin for IntelliJ IDEA that enables users to generate unit tests with only a few clicks.
TestSpark also allows users to easily modify and run each generated test and integrate them into the project workflow.
arXiv Detail & Related papers (2024-01-12T13:53:57Z) - REST: Retrieval-Based Speculative Decoding [69.06115086237207]
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation.
Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens.
When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation.
arXiv Detail & Related papers (2023-11-14T15:43:47Z) - CAT-LM: Training Language Models on Aligned Code And Tests [19.526181671936243]
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected.
We propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects.
arXiv Detail & Related papers (2023-10-02T19:52:22Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.