Related papers: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

URL: http://arxiv.org/abs/2504.19254v2
Date: Wed, 30 Apr 2025 16:49:15 GMT
Title: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Authors: Dylan Bouchard, Mohit Singh Chauhan,
Abstract summary: Hallucinations are a persistent problem with Large Language Models (LLMs)<n>We propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases.<n>To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

Related papers

SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs [2.805517909463769]
Large language models (LLMs) are increasingly deployed across diverse domains, yet they are prone to generating factually incorrect outputs.<n>We introduce a novel and scalable uncertainty-based semantic clustering framework for automated hallucination detection.
arXiv Detail & Related papers (2025-03-07T23:25:19Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models [2.98260857963929]
Large Language Models (LLMs) are known to hallucinate and generate non-factual outputs which can undermine user trust.<n>Traditional methods to directly mitigate hallucinations, such as representation editing and contrastive decoding, often require additional training data and involve high implementation complexity.<n>We propose Uncertainty-Aware Fusion (UAF), an ensemble framework to reduce hallucinations by strategically combining multiple LLM based on their accuracy and self-assessment abilities.
arXiv Detail & Related papers (2025-02-22T10:48:18Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models [0.0]
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models. This paper introduces THaMES, an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs.
arXiv Detail & Related papers (2024-09-17T16:55:25Z)
Cost-Effective Hallucination Detection for LLMs [11.58436181159839]
Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; third, performing detection by thresholding the score.
arXiv Detail & Related papers (2024-07-31T08:19:06Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Fine-Grained Self-Endorsement Improves Factuality and Reasoning [72.83651220132495]
This work studies improving large language model (LLM) generations at inference time by mitigating fact-conflicting hallucinations. We propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses.
arXiv Detail & Related papers (2024-02-23T22:24:40Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.