Related papers: Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs

Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs

URL: http://arxiv.org/abs/2411.01789v2
Date: Sat, 14 Dec 2024 17:19:41 GMT
Title: Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs
Authors: Shan Jiang, Chenguang Zhu, Sarfraz Khurshid,
Abstract summary: This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages.<n>We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
Score: 21.06722050714324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software testing remains the most widely used methodology for validating quality of code. However, effectiveness of testing critically depends on the quality of test suites used. Test cases in a test suite consist of two fundamental parts: (1) input values for the code under test, and (2) correct checks for the outputs it produces. These checks are commonly written as assertions, and termed test oracles. The last couple of decades have seen much progress in automated test input generation, e.g., using fuzzing and symbolic execution. However, automating test oracles remains a relatively less explored problem area. Indeed, a test oracle by its nature requires knowledge of expected behavior, which may only be known to the developer and may not not exist in a formal language that supports automated reasoning. Our focus in this paper is automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages. Our key insight is that Javadocs that provide a rich source of information can enable automated generation of test oracles. Javadocs of the core Java libraries are fairly detailed documents that contain natural language descriptions of not only how the libraries behave but also how the clients must (not) use them. We use large language models as an enabling technology to embody our insight into a framework for test oracle automation, and evaluate it experimentally. Our experiments demonstrate that LLMs can generate oracles for checking normal and exceptional behaviors from Javadocs, with 98.8% of these oracles being compilable and 96.4% accurately reflecting intended properties. Even for the few incorrect oracles, errors are minor and can be easily corrected with the help of additional comment information generated by the LLMs.

Related papers

Automated Test Generation from Program Documentation Encoded in Code Comments [4.696083734269232]
This paper introduces a novel test generation technique that exploits the code-comment documentation constructively. We deliver test cases with names and oracles properly contextualized on the target behaviors.
arXiv Detail & Related papers (2025-04-29T20:23:56Z)
Doc2OracLL: Investigating the Impact of Documentation on LLM-based Test Oracle Generation [0.716879432974126]
In Java, Javadoc comments provide structured, natural language documentation embedded directly in the source code. We dive deep into investigating the impact of Javadoc comments on test oracle generation (TOG)
arXiv Detail & Related papers (2024-12-12T15:27:47Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Do LLMs generate test oracles that capture the actual or the expected program behaviour? [7.772338538073763]
Large Language Models (LLMs) are trained on an enormous amount of data to generate developer-like code and test cases. This study includes developer-written and automatically generated test cases and oracles for 24 open-source Java repositories. LLMs are better at generating test oracles rather than classifying the correct ones, and can generate better test oracles when the code contains meaningful test or variable names.
arXiv Detail & Related papers (2024-10-28T15:37:06Z)
Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z)
Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks. This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z)
TOGLL: Correct and Strong Test Oracle Generation with LLMs [0.8057006406834466]
Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives. We present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles.
arXiv Detail & Related papers (2024-05-06T18:37:35Z)
Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text. Previous membership inference methods may be misled by similar examples in vast amounts of training data. We propose an alternative textitinsert-and-detection methodology, advocating that web users and content platforms employ textbftextitunique identifiers.
arXiv Detail & Related papers (2024-03-23T06:36:32Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions [0.05657375260432172]
Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided.
arXiv Detail & Related papers (2023-09-30T20:36:23Z)
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems. We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
Perfect is the enemy of test oracle [1.457696018869121]
Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. This paper presents SEER, a learning-based approach that in the absence of test assertions can determine whether a unit test passes or fails on a given method under test (MUT) Our experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is effective in predicting the fail or pass labels.
arXiv Detail & Related papers (2023-02-03T01:49:33Z)
Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent. It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.