Related papers: UQLM: A Python Package for Uncertainty Quantification in Large Language Models

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

URL: http://arxiv.org/abs/2507.06196v1
Date: Tue, 08 Jul 2025 17:22:32 GMT
Title: UQLM: A Python Package for Uncertainty Quantification in Large Language Models
Authors: Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad,
Abstract summary: We introduce UQLM, a Python package for hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques.<n>This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

Related papers

Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs [114.46334319795785]
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations"<n>We propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently.<n> Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results.
arXiv Detail & Related papers (2025-05-26T14:28:37Z)
A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs [71.97006967209539]
Large Language Models (LLMs) have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information.<n>Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs.<n>We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma 2.
arXiv Detail & Related papers (2025-05-13T03:30:26Z)
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought [10.166370877826486]
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses.<n>Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise.<n>We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought.
arXiv Detail & Related papers (2025-02-24T14:48:06Z)
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models [76.17975723711886]
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)<n>In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.<n>Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
arXiv Detail & Related papers (2025-02-20T10:25:13Z)
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
LUQ: Long-text Uncertainty Quantification for LLMs [29.987010627250527]
Large Language Models (LLMs) are prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation. We propose textscLuq-Ensemble, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty.
arXiv Detail & Related papers (2024-03-29T16:49:24Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities. We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.