Related papers: Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering

Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering

URL: http://arxiv.org/abs/2506.11021v1
Date: Fri, 16 May 2025 18:19:38 GMT
Title: Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering
Authors: Chaitanya Ravuri, Saman Amarasinghe,
Abstract summary: We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score.<n>Our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from 65% to 2%.<n>Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern code-generation LLMs can already solve a large fraction of programming problems, yet they still hallucinate subtle bugs that make their outputs unsafe for autonomous deployment. We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score. The wrapper samples many candidate programs, executes each on a self-generated test suite, and clusters candidates whose I/O behavior is identical; the empirical mass of the largest cluster serves as an exact confidence estimate. A single scalar threshold on this estimate lets users trade coverage for reliability with exponential guarantees. On LiveCodeBench our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from ~65% to 2%, and drives it to 0% at a conservative threshold while still answering 15.6% of prompts. Manual audits show that the few residual mistakes stem from prompt misinterpretation, not random generation noise, narrowing future work to specification clarity. Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models, offering a practical path toward dependable, autonomous code generation. Our code is available on Github (https://github.com/20ChaituR/functional-clustering).

Related papers

Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs [5.10123605644148]
Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair.<n>Recent studies show that large language models (LLMs) outperform traditional techniques.
arXiv Detail & Related papers (2025-07-28T16:39:16Z)
LLM-Based Repair of Static Nullability Errors [14.857404348789201]
We present NullRepair, a system that integrates LLMs into a structured workflow for resolving nullability errors from a nullability checker.<n>NullRepair resolves an average of 72% of the errors that remain after applying a state-of-the-art annotation inference technique.<n>Unlike a naively-prompted LLM, NullRepair also largely preserves program semantics.
arXiv Detail & Related papers (2025-07-28T09:55:04Z)
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation [31.363781211927947]
Large language models (LLMs) have achieved impressive performance in code generation.<n>LLMs are susceptible to error accumulation during code generation.<n>We propose ROCODE, which integrates the backtracking mechanism and program analysis into LLMs for code generation.
arXiv Detail & Related papers (2024-11-11T16:39:13Z)
$\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation. The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass. We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z)
SURE: A Visualized Failure Indexing Approach using Program Memory Spectrum [2.4151044161696587]
We propose SURE, a viSUalized failuRe indExing approach using the program memory spectrum. We first collect the run-time memory information at preset breakpoints during the execution of failed test cases. Any pair of PMS images that serve as proxies for two failures is fed to a trained Siamese convolutional neural network.
arXiv Detail & Related papers (2023-10-19T02:04:35Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.