Related papers: Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

URL: http://arxiv.org/abs/2406.14836v1
Date: Fri, 21 Jun 2024 02:40:34 GMT
Title: Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution
Authors: Sungmin Kang, Louis Milliken, Shin Yoo,
Abstract summary: We evaluate comments generated by three Large Language Models (LLMs) We propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail.
Score: 11.418182511485032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software comments are critical for human understanding of software, and as such many comment generation techniques have been proposed. However, we find that a systematic evaluation of the factual accuracy of generated comments is rare; only subjective accuracy labels have been given. Evaluating comments generated by three Large Language Models (LLMs), we find that even for the best-performing LLM, roughly a fifth of its comments contained demonstrably inaccurate statements. While it seems code-comment consistency detection techniques should be able to detect inaccurate comments, we perform experiments demonstrating they have no statistically significant relationship with comment accuracy, underscoring the substantial difficulty of this problem. To tackle this, we propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail. Furthermore, we implement our concept to verify Java comments. Experiments demonstrate that our approach has a robust statistical relationship with comment accuracy, making headway into a problem where prior techniques failed. Qualitative evaluation also reveals the promise of our approach in gaining developer trust, while highlighting the limitations of our current implementation.

Related papers

Detecting LLM-Generated Peer Reviews [37.51215252353345]
The rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently.<n>We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review.<n>We introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections.
arXiv Detail & Related papers (2025-03-20T01:11:35Z)
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.18215355266143]
We introduce a holistic code critique benchmark for Large Language Models (LLMs) called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics.
arXiv Detail & Related papers (2025-02-23T15:36:43Z)
Harnessing Large Language Models for Curated Code Reviews [2.5944208050492183]
In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models. We propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset.
arXiv Detail & Related papers (2025-02-05T18:15:09Z)
Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity [27.92468098611616]
We propose two novel semantic-based approaches for assessing code reviews. The first approach involves converting both the generated review and its reference into digital vectors using a deep learning model. The second approach generates a prompt based on the generated review and its reference, submits this prompt to ChatGPT, and requests ChatGPT to rate the generated review.
arXiv Detail & Related papers (2025-01-09T11:52:32Z)
DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation [11.010557279355885]
This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques.
arXiv Detail & Related papers (2024-12-24T08:53:54Z)
A Framework for Evaluating LLMs Under Task Indeterminacy [49.298107503257036]
Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus. We develop a framework for evaluating LLMs under task indeterminacy.
arXiv Detail & Related papers (2024-11-21T00:15:44Z)
On the Rationale and Use of Assertion Messages in Test Code: Insights from Software Practitioners [10.264620067797798]
Unit testing is an important practice that helps ensure the quality of a software system by validating its behavior through a series of test cases. Core to these test cases are assertion statements, which enable software practitioners to validate the correctness of the system's behavior. To aid with understanding and troubleshooting test case failures, practitioners can include a message (i.e., assertion message) within the assertion statement.
arXiv Detail & Related papers (2024-08-03T11:13:36Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications. We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines. We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Software Vulnerability and Functionality Assessment using LLMs [0.8057006406834466]
We investigate whether Large Language Models (LLMs) can aid with code reviews. Our investigation focuses on two tasks that we argue are fundamental to good reviews.
arXiv Detail & Related papers (2024-03-13T11:29:13Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. We collect responses generated from large language models and annotate factuality labels in a fine-grained manner. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z)
ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [17.166794984161964]
We show that ChatGPT can evaluate factual inconsistency under a zero-shot setting. It generally outperforms previous evaluation metrics on binary entailment inference, summary ranking, and consistency rating. However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
arXiv Detail & Related papers (2023-03-27T22:30:39Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code. We develop a deep-learning approach that learns to correlate a comment with code changes. We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.