Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification
- URL: http://arxiv.org/abs/2507.07236v2
- Date: Fri, 05 Sep 2025 17:54:18 GMT
- Title: Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification
- Authors: Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao,
- Abstract summary: MUSE is a simple information-theoretic method to identify and aggregate well-calibrated subsets of large language models.<n> Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na"ive ensemble baselines.
- Score: 9.397157329808254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na\"ive ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.
Related papers
- Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume [45.38219855706969]
We introduce UMPIRE, a training-free uncertainty quantification framework for Multimodal Large Language Models (MLLMs)<n>UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance.<n>We show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks.
arXiv Detail & Related papers (2026-02-27T17:18:42Z) - Making Foundation Models Probabilistic via Singular Value Ensembles [56.4174499669573]
Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining.<n>Standard approach to quantifying uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size.<n>We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption.<n>We show that SVE uncertainty quantification achieves comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%.
arXiv Detail & Related papers (2026-01-29T18:07:18Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - Token-Level Uncertainty Estimation for Large Language Model Reasoning [24.56760223952017]
Large Language Models (LLMs) have demonstrated impressive capabilities, but their output quality remains inconsistent across various application scenarios.<n>We propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning.
arXiv Detail & Related papers (2025-05-16T22:47:32Z) - Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z) - Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models [36.81503322875839]
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering.<n>This paper investigates representative MLLMs, focusing on their calibration across various scenarios.<n>We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios.
arXiv Detail & Related papers (2024-12-19T09:10:07Z) - CLUE: Concept-Level Uncertainty Estimation for Large Language Models [49.92690111618016]
We propose a novel framework for Concept-Level Uncertainty Estimation for Large Language Models (LLMs)
We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately.
We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty.
arXiv Detail & Related papers (2024-09-04T18:27:12Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important.
We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z) - SPUQ: Perturbation-Based Uncertainty Quantification for Large Language
Models [9.817185255633758]
Large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities.
A pressing challenge is their tendency to make confidently wrong predictions.
We introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties.
Our findings show a substantial improvement in model calibration, with a reduction in Expected Error (ECE) by 50% on average.
arXiv Detail & Related papers (2024-03-04T21:55:22Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation [50.920911532133154]
The intrinsic ill-posedness and ordinal-sensitive nature of monocular depth estimation (MDE) models pose major challenges to the estimation of uncertainty degree.
We propose to model the uncertainty of MDE models from the perspective of the inherent probability distributions.
By simply introducing additional training regularization terms, our model, with surprisingly simple formations and without requiring extra modules or multiple inferences, can provide uncertainty estimations with state-of-the-art reliability.
arXiv Detail & Related papers (2023-07-19T12:11:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.