Related papers: Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions

Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions

URL: http://arxiv.org/abs/2310.00483v1
Date: Sat, 30 Sep 2023 20:36:23 GMT
Title: Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions
Authors: Vincent Li, Nick Doiron
Abstract summary: Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided.
Score: 0.05657375260432172
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. Unit tests are tests designed to test small components of a codebase in isolation, such as an individual function or method. Although unit tests have historically been written by human programmers, recent advancements in AI, particularly LLMs, have shown corresponding advances in automatic unit test generation. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter, a GPT-4-based LLM, on Python functions provided by the Quixbugs dataset, and we focus on prompting due to the ease with which users can make use of our findings and observations. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided. However, we observe that Code Interpreter is often able to effectively identify and correct mistakes in code that it writes, suggesting that providing it runnable code to check the correctness of its outputs would be beneficial, even though we find that it is already often able to generate correctly-formatted unit tests. Our findings suggest that, when prompting models similar to Code Interpreter, it is important to include the basic information necessary to generate unit tests, but minor details are not as important.

Related papers

Automated Test Generation from Program Documentation Encoded in Code Comments [4.696083734269232]
This paper introduces a novel test generation technique that exploits the code-comment documentation constructively. We deliver test cases with names and oracles properly contextualized on the target behaviors.
arXiv Detail & Related papers (2025-04-29T20:23:56Z)
Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs) We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z)
ViUniT: Visual Unit Tests for More Robust Visual Programming [104.55763189099125]
When models answer correctly, they produce incorrect programs 33% of the time. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests.
arXiv Detail & Related papers (2024-12-12T01:36:18Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z)
Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z)
Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text. Previous membership inference methods may be misled by similar examples in vast amounts of training data. We propose an alternative textitinsert-and-detection methodology, advocating that web users and content platforms employ textbftextitunique identifiers.
arXiv Detail & Related papers (2024-03-23T06:36:32Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs. For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z)
CAT-LM: Training Language Models on Aligned Code And Tests [19.526181671936243]
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. We propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects.
arXiv Detail & Related papers (2023-10-02T19:52:22Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Automated Support for Unit Test Generation: A Tutorial Book Chapter [21.716667622896193]
Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system is tested. Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python. This chapter introduces the concept of search-based unit test generation.
arXiv Detail & Related papers (2021-10-26T11:13:40Z)
Auditing AI models for Verified Deployment under Semantic Specifications [65.12401653917838]
AuditAI bridges the gap between interpretable formal verification and scalability. We show how AuditAI allows us to obtain controlled variations for verification and certified training while addressing the limitations of verifying using only pixel-space perturbations.
arXiv Detail & Related papers (2021-09-25T22:53:24Z)
Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation. In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.