Related papers: Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

URL: http://arxiv.org/abs/2502.14427v1
Date: Thu, 20 Feb 2025 10:25:13 GMT
Title: Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Authors: Artem Vazhentsev, Lyudmila Rvanova, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov,
Abstract summary: Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)<n>In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.<n>Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
Score: 76.17975723711886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.

Related papers

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought [10.166370877826486]
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, which incurs high computational costs. We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process.
arXiv Detail & Related papers (2025-02-24T14:48:06Z)
CoCoA: A Generalized Approach to Uncertainty Quantification by Integrating Confidence and Consistency of LLM Outputs [35.74755307680801]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompasses a variety of approaches. We propose a new way of synthesizing model confidence and output consistency that leads to a family of efficient and robust UQ methods.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
LLM Confidence Evaluation Measures in Zero-Shot CSS Classification [1.6410524749379551]
We propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks. We introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.
arXiv Detail & Related papers (2024-10-16T21:17:18Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
LUQ: Long-text Uncertainty Quantification for LLMs [29.987010627250527]
Large Language Models (LLMs) are prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation. We propose textscLuq-Ensemble, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty.
arXiv Detail & Related papers (2024-03-29T16:49:24Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning. DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.