Related papers: Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs

Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs

URL: http://arxiv.org/abs/2507.14649v1
Date: Sat, 19 Jul 2025 14:48:24 GMT
Title: Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs
Authors: Minsuh Joo, Hyunsoo Cho,
Abstract summary: This study proposes an effective uncertainty estimation approach, textbfClusttextbfering-based semtextbfantic contextbfsisttextbfency (textbfCleanse)<n>The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B.
Score: 5.161416961439468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the outstanding performance of large language models (LLMs) across various NLP tasks, hallucinations in LLMs--where LLMs generate inaccurate responses--remains as a critical problem as it can be directly connected to a crisis of building safe and reliable LLMs. Uncertainty estimation is primarily used to measure hallucination levels in LLM responses so that correct and incorrect answers can be distinguished clearly. This study proposes an effective uncertainty estimation approach, \textbf{Cl}ust\textbf{e}ring-based sem\textbf{an}tic con\textbf{s}ist\textbf{e}ncy (\textbf{Cleanse}). Cleanse quantifies the uncertainty with the proportion of the intra-cluster consistency in the total consistency between LLM hidden embeddings which contain adequate semantic information of generations, by employing clustering. The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B and two question-answering benchmarks, SQuAD and CoQA.

Related papers

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking [14.76224690767612]
Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks.<n>We introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity.
arXiv Detail & Related papers (2025-07-22T03:21:48Z)
"I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts? [10.454446545249096]
This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an undecided'' class)<n>We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs)<n>Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.
arXiv Detail & Related papers (2025-02-18T11:00:28Z)
Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering [1.9214041945441436]
We present a new approach for evaluating semanticencies of Large Language Model (LLM) Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning. Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question and clusters semantically equivalent sentences to measure semantic consistency across 37 categories.
arXiv Detail & Related papers (2024-10-20T16:21:25Z)
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.<n>Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration.<n>To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z)
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection [39.52923659121416]
We propose to explore the dense semantic information retained within textbfINternal textbfStates for halluctextbfInation textbfDEtection. A simple yet effective textbfEigenScore metric is proposed to better evaluate responses' self-consistency. A test time feature clipping approach is explored to truncate extreme activations in the internal states.
arXiv Detail & Related papers (2024-02-06T06:23:12Z)
Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods. We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.