Related papers: PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation

PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation

URL: http://arxiv.org/abs/2401.07576v2
Date: Fri, 22 Nov 2024 06:42:56 GMT
Title: PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation
Authors: Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, Yuan-Fang Li,
Abstract summary: Test-driven development (TDD) mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. We introduce PyTester, a Text-to-Testcase generation approach that can automatically generate correct, executable, complete, and effective test cases.
Score: 20.441921569948562
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.

Related papers

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework [2.501198441875755]
AgoneTest is an evaluation framework for Large Language Model-generated unit tests in Java.<n>For the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection.
arXiv Detail & Related papers (2025-11-25T15:33:00Z)
ATGen: Adversarial Reinforcement Learning for Test Case Generation [78.48498301767079]
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
arXiv Detail & Related papers (2025-10-16T12:49:25Z)
CodeChemist: Functional Knowledge Transfer for Low-Resource Code Generation via Test-Time Scaling [63.08126845138046]
We present CodeChemist, a framework for test-time scaling that enables functional knowledge transfer from high-resource to low-resource PLs.<n>Our experiments show that CodeChemist outperforms existing test-time scaling approaches.
arXiv Detail & Related papers (2025-10-01T04:33:53Z)
Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
LLM-based Unit Test Generation for Dynamically-Typed Programs [16.38145000434927]
TypeTest is a novel framework that enhances type correctness in test generation through a vector-based Retrieval-Augmented Generation system. In an evaluation on 125 real-world Python modules, TypeTest achieved an average statement coverage of 86.6% and branch coverage of 76.8%, outperforming state-of-theart tools by 5.4% and 9.3%, respectively.
arXiv Detail & Related papers (2025-03-18T08:07:17Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs. We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? [11.762669773233474]
Test-driven development (TDD) is the practice of writing tests first and coding later. This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories.
arXiv Detail & Related papers (2024-12-03T22:38:05Z)
Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z)
KAT: Dependency-aware Automated API Testing with Large Language Models [1.7264233311359707]
KAT (Katalon API Testing) is a novel AI-driven approach that autonomously generates test cases to validate APIs. Our evaluation of KAT using 12 real-world services shows that it can improve validation coverage, detect more undocumented status codes, and reduce false positives in these services.
arXiv Detail & Related papers (2024-07-14T14:48:18Z)
Test-Driven Development for Code Generation [0.850206009406913]
Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
arXiv Detail & Related papers (2024-02-21T04:10:12Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
CAT-LM: Training Language Models on Aligned Code And Tests [19.526181671936243]
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. We propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects.
arXiv Detail & Related papers (2023-10-02T19:52:22Z)
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information. Test-time adaptive (TTA) methods are proposed to address this issue. In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Learning Deep Semantics for Test Completion [46.842174440120196]
We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo -- a deep learning model using code semantics for test completion.
arXiv Detail & Related papers (2023-02-20T18:53:56Z)
TeST: Test-time Self-Training under Distribution Shift [99.68465267994783]
Test-Time Self-Training (TeST) is a technique that takes as input a model trained on some source data and a novel data distribution at test time. We find that models adapted using TeST significantly improve over baseline test-time adaptation algorithms.
arXiv Detail & Related papers (2022-09-23T07:47:33Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers [10.846226514357866]
Unit testing represents the foundational basis of the software testing pyramid. We present an approach to support developers in writing unit test cases by generating accurate and useful assert statements.
arXiv Detail & Related papers (2020-09-11T19:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.