Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering
- URL: http://arxiv.org/abs/2506.11021v1
- Date: Fri, 16 May 2025 18:19:38 GMT
- Title: Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering
- Authors: Chaitanya Ravuri, Saman Amarasinghe,
- Abstract summary: We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score.<n>Our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from 65% to 2%.<n>Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Modern code-generation LLMs can already solve a large fraction of programming problems, yet they still hallucinate subtle bugs that make their outputs unsafe for autonomous deployment. We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score. The wrapper samples many candidate programs, executes each on a self-generated test suite, and clusters candidates whose I/O behavior is identical; the empirical mass of the largest cluster serves as an exact confidence estimate. A single scalar threshold on this estimate lets users trade coverage for reliability with exponential guarantees. On LiveCodeBench our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from ~65% to 2%, and drives it to 0% at a conservative threshold while still answering 15.6% of prompts. Manual audits show that the few residual mistakes stem from prompt misinterpretation, not random generation noise, narrowing future work to specification clarity. Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models, offering a practical path toward dependable, autonomous code generation. Our code is available on Github (https://github.com/20ChaituR/functional-clustering).
Related papers
- Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis [11.687400527666476]
This paper investigates whether a deterministic, static-analysis framework can reliably detect textitand auto-correct KCHs.<n>We propose a post-processing framework that parses generated code into an Abstract Syntax Tree (AST) and validates it against a dynamically-generated Knowledge Base (KB)<n>This non-executing approach uses deterministic rules to find and fix both API and identifier-level conflicts.
arXiv Detail & Related papers (2026-01-27T02:16:37Z) - Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs [0.0]
Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt.<n>We introduce a model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting.<n>We prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true and false-positive probabilities.
arXiv Detail & Related papers (2026-01-02T10:52:33Z) - Localized Calibrated Uncertainty in Code Language Models [1.2733370160280995]
We offer techniques to localize where generations might be misaligned from user intent.<n>We measure how well various techniques can assign a well-calibrated probability to indicate which parts of code will be edited in a minimal patch.<n>We find probes with a small supervisor model can achieve low calibration error and Brier Skill Score of approx 0.2 estimating edited lines on code generated by models many orders of magnitude.
arXiv Detail & Related papers (2025-12-31T02:00:17Z) - DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z) - Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity [2.7389338551082605]
We develop a benchmark to test Large Language Models (LLMs) for anticipating performance bottlenecks.<n>FLOPBench predicts single and double-precision FLOP counts for 577 kernels.<n>Our results positionFLOPBench as a focused testbed for developing LLM tooling.
arXiv Detail & Related papers (2025-12-04T01:03:20Z) - Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z) - Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z) - Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z) - Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs [5.10123605644148]
Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair.<n>Recent studies show that large language models (LLMs) outperform traditional techniques.
arXiv Detail & Related papers (2025-07-28T16:39:16Z) - LLM-Based Repair of Static Nullability Errors [14.857404348789201]
We present NullRepair, a system that integrates LLMs into a structured workflow for resolving nullability errors from a nullability checker.<n>NullRepair resolves an average of 72% of the errors that remain after applying a state-of-the-art annotation inference technique.<n>Unlike a naively-prompted LLM, NullRepair also largely preserves program semantics.
arXiv Detail & Related papers (2025-07-28T09:55:04Z) - MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions [60.43398881149664]
We introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LLM Output Signature.<n>It achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation [31.363781211927947]
Large language models (LLMs) have achieved impressive performance in code generation.<n>LLMs are susceptible to error accumulation during code generation.<n>We propose ROCODE, which integrates the backtracking mechanism and program analysis into LLMs for code generation.
arXiv Detail & Related papers (2024-11-11T16:39:13Z) - $\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation.
The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass.
We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - SURE: A Visualized Failure Indexing Approach using Program Memory
Spectrum [2.4151044161696587]
We propose SURE, a viSUalized failuRe indExing approach using the program memory spectrum.
We first collect the run-time memory information at preset breakpoints during the execution of failed test cases.
Any pair of PMS images that serve as proxies for two failures is fed to a trained Siamese convolutional neural network.
arXiv Detail & Related papers (2023-10-19T02:04:35Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.