Related papers: Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

URL: http://arxiv.org/abs/2511.12288v2
Date: Sat, 22 Nov 2025 03:27:30 GMT
Title: Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation
Authors: Yihan Dai, Sijie Liang, Haotian Xu, Peichu Xie, Sergey Mechtaev,
Abstract summary: We introduce semantic triangulation, which transforms a programming problem in a way that preserves an exact, verifiable mapping between solutions.<n>On the LiveCodeBench and CodeElo benchmarks, semantic triangulation increases reliability of generated code by 21%.<n>It is also the only approach to consistently form true consensus on tasks with multiple valid but non-equivalent solutions.
Score: 2.8646222242803643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When generating code from natural language prompts, an LLM samples programs from a probability distribution, many of which might be incorrect. Sample consensus techniques - such as majority voting or validation against generated tests or specifications - aim to identify a correct program in the sample or abstain if none is valid. However, existing methods often fail to select a correct solution when its sampling probability is low, or when the problem permits multiple valid but non-equivalent solutions. Additionally, they often fail to abstain when no correct solution is present in the sample. To overcome these limitations, we introduce semantic triangulation, which transforms a programming problem in a way that non-trivially alters its semantics while preserving an exact, verifiable mapping between solutions before and after transformation. We theoretically establish that verifying consistency across such problem transformations increases confidence that generated programs reflect accurate generalization rather than spurious statistical correlations, enabling more reliable sample consensus and abstention. On the LiveCodeBench and CodeElo benchmarks, using GPT-4o and DeepSeek-V3 models, semantic triangulation increases reliability of generated code by 21% compared to the method that selects only high-confidence solutions with the probability threshold 0.5, while being able to pinpoint correct solutions at sampling probabilities as low as 0.14. Apart from that, it is also the only approach to consistently form true consensus on tasks with multiple valid but non-equivalent solutions.

Related papers

BEAVER: An Efficient Deterministic LLM Verifier [11.949243456810263]
We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on large language models.<n>We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks.
arXiv Detail & Related papers (2025-12-05T05:34:06Z)
Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z)
Constrained Adaptive Rejection Sampling [27.579645342312674]
Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints.<n>Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM's distribution.<n>We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion.
arXiv Detail & Related papers (2025-10-02T11:17:26Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts.<n>This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [13.823743787003787]
Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models.<n>We show that no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.<n>We also show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
arXiv Detail & Related papers (2024-11-26T15:13:06Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)
Certifying Neural Network Robustness to Random Input Noise from Samples [14.191310794366075]
Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. We propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution.
arXiv Detail & Related papers (2020-10-15T05:27:21Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.