Related papers: Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

URL: http://arxiv.org/abs/2512.09538v1
Date: Wed, 10 Dec 2025 11:24:29 GMT
Title: Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Authors: Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov,
Abstract summary: We introduce a new family of methods that employ beam search to generate candidates for consistency-based uncertainty estimates.<n>We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
Score: 111.6996614063716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

Related papers

Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
The Polynomial Stein Discrepancy for Assessing Moment Convergence [1.0835264351334324]
We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference.<n>We show that the test has higher power than its competitors in several examples, and at a lower computational cost.
arXiv Detail & Related papers (2024-12-06T15:51:04Z)
Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems. We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z)
SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models [9.817185255633758]
Large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities. A pressing challenge is their tendency to make confidently wrong predictions. We introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. Our findings show a substantial improvement in model calibration, with a reduction in Expected Error (ECE) by 50% on average.
arXiv Detail & Related papers (2024-03-04T21:55:22Z)
Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation [6.929834518749884]
Large Language Models have emerged as prime candidates to tackle misinformation mitigation. Existing approaches struggle with hallucinations and overconfident predictions. We propose an uncertainty quantification framework that leverages both direct confidence elicitation and sampled-based consistency methods.
arXiv Detail & Related papers (2024-01-13T16:36:58Z)
ProBoost: a Boosting Method for Probabilistic Classifiers [55.970609838687864]
ProBoost is a new boosting algorithm for probabilistic classifiers. It uses the uncertainty of each training sample to determine the most challenging/uncertain ones. It produces a sequence that progressively focuses on the samples found to have the highest uncertainty.
arXiv Detail & Related papers (2022-09-04T12:49:20Z)
Learning generative models for valid knockoffs using novel multivariate-rank based statistics [12.528602250193206]
Rank energy (RE) is derived using theoretical results characterizing the optimal maps in the Monge's Optimal Transport (OT) problem. We propose a variant of the RE, dubbed as soft rank energy (sRE), and its kernel variant called as soft rank maximum mean discrepancy (sRMMD) We then use sRMMD to generate deep knockoffs and show via extensive evaluation that it is a novel and effective method to produce valid knockoffs.
arXiv Detail & Related papers (2021-10-29T18:51:19Z)
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.