METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries
- URL: http://arxiv.org/abs/2502.02794v1
- Date: Wed, 05 Feb 2025 00:42:50 GMT
- Title: METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries
- Authors: Hyeonseok Lee, Gabin An, Shin Yoo,
- Abstract summary: This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases.
METAMON is supported in this task by metamorphic testing and self-consistency.
An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.
- Score: 10.9334354663311
- License:
- Abstract: Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can be harmful for the developer's understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases, and subsequently uses LLM-based code reasoning to identify the generated regression test oracles that are not consistent with the program specifications in the documentation. METAMON is supported in this task by metamorphic testing and self-consistency. An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.
Related papers
- CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation [10.048098631259876]
Code generation aims to produce code that fulfills requirements written in natural languages automatically.
Large language Models (LLMs) like ChatGPT fail to ensure the syntactic and semantic correctness of the generated code.
We propose CodeCoR, a self-reflective multi-agent framework that evaluates the effectiveness of each agent and their collaborations.
arXiv Detail & Related papers (2025-01-14T03:21:10Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why? [9.246899995643918]
Large Language Models can still generate defective code that deviates from the specification.
Seven categories of non-syntactic mistakes were identified through extensive manual analyses.
Our evaluation demonstrated that GPT-4 with the ReAct prompting technique can achieve an F1 score of up to 0.65 when identifying reasons for LLM's mistakes.
arXiv Detail & Related papers (2024-11-03T02:47:03Z) - Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs.
Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code.
Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z) - Test-Driven Development for Code Generation [0.850206009406913]
Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements.
This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
arXiv Detail & Related papers (2024-02-21T04:10:12Z) - GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence [64.95492752484171]
We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks.
GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support.
Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains.
arXiv Detail & Related papers (2024-02-19T21:45:55Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - FLAG: Finding Line Anomalies (in code) with Generative AI [18.612900041820875]
FLAG is based on the lexical capabilities of generative AI, specifically, Large Language Models (LLMs)
We use 121 benchmarks across C, Python and Verilog; with each benchmark containing a known security or functional weakness.
FLAG can identify 101 of the defects and helps reduce the search space to 12-17% of source code.
arXiv Detail & Related papers (2023-06-22T03:04:56Z) - Execution-based Evaluation for Data Science Code Generation Models [97.96608263010913]
We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks.
ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output.
We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
arXiv Detail & Related papers (2022-11-17T07:04:11Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.