Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
- URL: http://arxiv.org/abs/2505.23799v2
- Date: Mon, 02 Jun 2025 15:55:44 GMT
- Title: Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
- Authors: Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer,
- Abstract summary: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations.<n>We propose a logit-based ensemble method for estimating LLM consistency.
- Score: 7.902385931726113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.
Related papers
- Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs [5.161416961439468]
This study proposes an effective uncertainty estimation approach, textbfClusttextbfering-based semtextbfantic contextbfsisttextbfency (textbfCleanse)<n>The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B.
arXiv Detail & Related papers (2025-07-19T14:48:24Z) - Uncertainty Quantification for LLM-Based Survey Simulations [9.303339416902995]
We investigate the use of large language models (LLMs) to simulate human responses to survey questions.<n>Our approach converts imperfect LLM-simulated responses into confidence sets for population parameters.<n>A key innovation lies in determining the optimal number of simulated responses.
arXiv Detail & Related papers (2025-02-25T02:07:29Z) - Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering [1.9214041945441436]
We present a new approach for evaluating semanticencies of Large Language Model (LLM)
Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning.
Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question and clusters semantically equivalent sentences to measure semantic consistency across 37 categories.
arXiv Detail & Related papers (2024-10-20T16:21:25Z) - The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism [39.392450788666814]
Current evaluations of large language models (LLMs) often overlook non-determinism.
greedy decoding generally outperforms sampling methods for most evaluated tasks.
Smaller LLMs can match or surpass larger models such as GPT-4-Turbo.
arXiv Detail & Related papers (2024-07-15T06:12:17Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment [9.156064716689833]
This study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation.<n>We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B.<n>Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art.
arXiv Detail & Related papers (2024-03-08T00:19:24Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.