DeCon: Detecting Incorrect Assertions via Postconditions Generated by a Large Language Model
- URL: http://arxiv.org/abs/2501.02901v1
- Date: Mon, 06 Jan 2025 10:25:28 GMT
- Title: DeCon: Detecting Incorrect Assertions via Postconditions Generated by a Large Language Model
- Authors: Hao Yu, Tianyu Chen, Jiaming Huang, Zongyang Li, Dezhi Ran, Xinyu Wang, Ying Li, Assaf Marron, David Harel, Yuan Xie, Tao Xie,
- Abstract summary: We propose a new approach named DeCon to effectively detect incorrect assertions via LLM-generated postconditions for the target problem.<n>DeCon can detect averagely more than 64% (63% and 65.5% detected by GPT-3.5 and GPT-4, respectively) incorrect assertions generated by four state-of-the-art LLMs.
- Score: 22.38753408614465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, given the docstring for the target problem and the target function signature, large language models (LLMs) have been used not only to generate source code, but also to generate test cases, consisting of test inputs and assertions (e.g., in the form of checking an actual output against the expected output). However, as shown by our empirical study on assertions generated by four LLMs for the HumanEval benchmark, over 62% of the generated assertions are incorrect (i.e., failed on the ground-truth problem solution). To detect incorrect assertions (given the docstring and the target function signature along with a sample of example inputs and outputs), in this paper, we propose a new approach named DeCon to effectively detect incorrect assertions via LLM-generated postconditions for the target problem (a postcondition is a predicate that must always be true just after the execution of the ground-truth problem solution). Our approach requires a small set of I/O examples (i.e., a sample of example inputs and outputs) for the target problem (e.g., the I/O examples included in the docstring for a target problem in HumanEval). We use the given I/O examples to filter out those LLM-generated postconditions that are violated by at least one given I/O example. We then use the remaining postconditions to detect incorrect assertions as those assertions that violate at least one remaining postcondition. Experimental results show that DeCon can detect averagely more than 64% (63% and 65.5% detected by GPT-3.5 and GPT-4, respectively) incorrect assertions generated by four state-of-the-art LLMs, and DeCon can also improve the effectiveness of these LLMs in code generation by 4% in terms of Pass@1. In addition, although DeCon might filter out correct assertions, the fault-finding ability of the remaining correct assertions decreases only slightly.
Related papers
- Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)
We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.
We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - VALTEST: Automated Validation of Language Model Generated Test Cases [0.7059472280274008]
Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases.
This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities.
arXiv Detail & Related papers (2024-11-13T00:07:32Z) - Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)
RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.
Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Rethinking the Influence of Source Code on Test Case Generation [22.168699378889148]
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context.
This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests?
Our evaluation results demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests.
arXiv Detail & Related papers (2024-09-14T15:17:34Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - ReAssert: Deep Learning for Assert Generation [3.8174671362014956]
We present RE-ASSERT, an approach for the automated generation of JUnit test asserts.
This is achieved by targeting projects individually, using precise code-to-test traceability for learning.
We also utilise Reformer, a state-of-the-art deep learning model, along with two models from previous work to evaluate ReAssert and an existing approach, known as ATLAS.
arXiv Detail & Related papers (2020-11-19T11:55:59Z) - Model Assertions for Monitoring and Improving ML Models [26.90089824436192]
We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models.
Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring.
We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models.
arXiv Detail & Related papers (2020-03-03T17:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.