Related papers: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

URL: http://arxiv.org/abs/2505.16170v2
Date: Tue, 27 May 2025 21:14:53 GMT
Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
Authors: Yuqing Yang, Robin Jia,
Abstract summary: We define the behavior of acknowledging errors in previously generated answers as "retraction"<n>We demonstrate that retraction is closely tied to indicators of models' internal belief.<n>Experiments show that internal belief causally influences model retraction.
Score: 24.49830646625232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.

Related papers

Rectifying Belief Space via Unlearning to Harness LLMs' Reasoning [36.74368293113009]
We propose a method to rectify the belief space by suppressing spurious beliefs while simultaneously enhancing true ones.<n>Our approach first identifies the beliefs that lead to incorrect or correct answers by prompting the model to generate textual explanations.<n>We then apply unlearning to suppress the identified spurious beliefs and enhance the true ones, effectively rectifying the model's belief space.
arXiv Detail & Related papers (2025-02-28T00:57:45Z)
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? [55.33467849079774]
Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of Large Language Models.<n>We investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge.
arXiv Detail & Related papers (2025-02-20T12:31:03Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases.<n>LLMs often struggle with question-answering tasks and are prone to hallucinations.<n>We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z)
Understanding Knowledge Drift in LLMs through Misinformation [11.605377799885238]
Large Language Models (LLMs) have revolutionized numerous applications, making them an integral part of our digital ecosystem. We analyze the susceptibility of state-of-the-art LLMs to factual inaccuracies when they encounter false information in a QnA scenario. Our experiments reveal that an LLM's uncertainty can increase up to 56.6% when the question is answered incorrectly.
arXiv Detail & Related papers (2024-09-11T08:11:16Z)
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers. We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z)
ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence [22.89240200094172]
We benchmark six top-performing large language models (LLMs) on a dataset of over 1200 questions.<n>We find that LLMs are susceptible to adopting incorrect retrieved content over 60% of the time.<n>We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content.
arXiv Detail & Related papers (2024-04-16T00:43:03Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs) We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect) We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
Calibration Meets Explanation: A Simple and Effective Approach for Model Confidence Estimates [21.017890579840145]
We propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. We conduct extensive experiments on six datasets with two popular pre-trained language models. Our findings highlight that model explanations can help calibrate posterior estimates.
arXiv Detail & Related papers (2022-11-06T06:17:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.