Related papers: Gauging Overprecision in LLMs: An Empirical Study

Gauging Overprecision in LLMs: An Empirical Study

URL: http://arxiv.org/abs/2504.12098v2
Date: Sun, 27 Apr 2025 11:31:23 GMT
Title: Gauging Overprecision in LLMs: An Empirical Study
Authors: Adil Bahaj, Hamed Rahimi, Mohamed Chetouani, Mounir Ghogho,
Abstract summary: This study is inspired by a different aspect of overconfidence in cognitive science called textitoverprecision.<n>In the generation phase, we prompt the LLM to generate answers to numerical questions in the form of intervals with a certain level of confidence.<n>In the refinement phase, answers from the previous phase are refined to generate better answers.
Score: 5.359801516815977
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recently, overconfidence in large language models (LLMs) has garnered considerable attention due to its fundamental importance in quantifying the trustworthiness of LLM generation. However, existing approaches prompt the \textit{black box LLMs} to produce their confidence (\textit{verbalized confidence}), which can be subject to many biases and hallucinations. Inspired by a different aspect of overconfidence in cognitive science called \textit{overprecision}, we designed a framework for its study in black box LLMs. This framework contains three main phases: 1) generation, 2) refinement and 3) evaluation. In the generation phase we prompt the LLM to generate answers to numerical questions in the form of intervals with a certain level of confidence. This confidence level is imposed in the prompt and not required for the LLM to generate as in previous approaches. We use various prompting techniques and use the same prompt multiple times to gauge the effects of randomness in the generation process. In the refinement phase, answers from the previous phase are refined to generate better answers. The LLM answers are evaluated and studied in the evaluation phase to understand its internal workings. This study allowed us to gain various insights into LLM overprecision: 1) LLMs are highly uncalibrated for numerical tasks 2) there is no correlation between the length of the interval and the imposed confidence level, which can be symptomatic of a a) lack of understanding of the concept of confidence or b) inability to adjust self-confidence by following instructions, {3) LLM numerical precision differs depending on the task, scale of answer and prompting technique 4) Refinement of answers doesn't improve precision in most cases. We believe this study offers new perspectives on LLM overconfidence and serves as a strong baseline for overprecision in LLMs.

Related papers

Understanding the Dark Side of LLMs' Intrinsic Self-Correction [55.51468462722138]
Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. Recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. We identify intrinsic self-correction can cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions.
arXiv Detail & Related papers (2024-12-19T15:39:31Z)
Learning to Route LLMs with Confidence Tokens [43.63392143501436]
We study the extent to which large language models can reliably indicate confidence in their answers. We propose Self-REF, a lightweight training strategy to teach LLMs to express confidence in a reliable manner. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
arXiv Detail & Related papers (2024-10-17T07:28:18Z)
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales [29.33581578047835]
SaySelf is a training framework that teaches large language models to express more accurate fine-grained confidence estimates. In addition, SaySelf directs LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration.
arXiv Detail & Related papers (2024-05-31T16:21:16Z)
Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience [41.06726400259579]
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks. We propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression.
arXiv Detail & Related papers (2024-04-16T06:47:49Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models [47.439995799065755]
We pioneer the exploration of LLM's trustworthiness during pre-training. We focus on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. We are the first to observe a similar two-phase phenomenon: fitting and compression.
arXiv Detail & Related papers (2024-02-29T18:55:06Z)
Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps. Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%. FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness [16.35655151252159]
We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model. Our uncertainty quantification technique works for any LLM accessible only via a black-box API.
arXiv Detail & Related papers (2023-08-30T17:53:25Z)
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning. This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation. We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.