Related papers: Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

URL: http://arxiv.org/abs/2508.08661v1
Date: Tue, 12 Aug 2025 05:59:33 GMT
Title: Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
Authors: Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam,
Abstract summary: hallucinations have been studied independently in natural language and code generation.<n> hallucinations occur in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation.<n>We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them.
Score: 2.990411348977783
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance.

Related papers

A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI [54.34738767990601]
As Large Language Models become increasingly integrated into software engineering tasks, understanding and mitigating hallucination in code becomes essential.<n>We provide a systematic review of hallucination phenomena in code-oriented LLMs from four key perspectives.
arXiv Detail & Related papers (2025-11-02T02:58:41Z)
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA [46.50540400870401]
PsiloQA is a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages.<n>Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
arXiv Detail & Related papers (2025-10-06T14:36:30Z)
(Im)possibility of Automated Hallucination Detection in Large Language Models [40.13262095901877]
We introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs)<n>We investigate whether an algorithm, trained on examples drawn from an unknown target language, can reliably determine whether the LLM's outputs are correct or constitute hallucinations.<n>We show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements) dramatically changes this conclusion.
arXiv Detail & Related papers (2025-04-23T18:00:07Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [78.78822033285938]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities.<n>To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries [29.561699707926056]
Large language models (LLMs) are prone to hallucination-outputs that stray from intended meanings.<n>We introduce a first-of-its-kind dataset with $sim$10K samples, curated specifically for hallucination detection in code summarization.
arXiv Detail & Related papers (2024-10-17T19:38:55Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.<n>We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations.<n>We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.