Related papers: Are self-explanations from Large Language Models faithful?

Are self-explanations from Large Language Models faithful?

URL: http://arxiv.org/abs/2401.07927v4
Date: Thu, 16 May 2024 20:26:43 GMT
Title: Are self-explanations from Large Language Models faithful?
Authors: Andreas Madsen, Sarath Chandar, Siva Reddy,
Abstract summary: Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. It's important to measure if self-explanations truly reflect the model's behavior. We propose employing self-consistency checks to measure faithfulness.
Score: 35.40666730867487
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

Related papers

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction [24.49830646625232]
We define the behavior of acknowledging errors in previously generated answers as "retraction"<n>We demonstrate that retraction is closely tied to indicators of models' internal belief.<n>Experiments show that internal belief causally influences model retraction.
arXiv Detail & Related papers (2025-05-22T03:16:00Z)
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations [0.8949668577519213]
Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. These explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. We introduce a new approach for measuring the faithfulness of LLM explanations.
arXiv Detail & Related papers (2025-04-19T02:51:20Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models [29.67884478799914]
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness.
arXiv Detail & Related papers (2024-10-18T03:45:42Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales [29.33581578047835]
SaySelf is a training framework that teaches large language models to express more accurate fine-grained confidence estimates. In addition, SaySelf directs LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration.
arXiv Detail & Related papers (2024-05-31T16:21:16Z)
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? [21.814007454504978]
We show that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. We formalize faithful response uncertainty based on the gap between the model's intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed.
arXiv Detail & Related papers (2024-05-27T07:56:23Z)
"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance. We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy. Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z)
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models [26.11408084129897]
Large Language Models (LLMs) are deployed as powerful tools for several natural language processing (NLP) applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. We discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs.
arXiv Detail & Related papers (2024-02-07T06:32:50Z)
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting [87.30837365008931]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
arXiv Detail & Related papers (2024-01-28T06:50:10Z)
On Measuring Faithfulness or Self-consistency of Natural Language Explanations [22.37545779269458]
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought explanations. Recent work has designed tests that aim to judge the faithfulness of these explanations. We argue that these tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level.
arXiv Detail & Related papers (2023-11-13T16:53:51Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Language Models with Rationality [57.37201135072838]
Large language models (LLMs) are proficient at question-answering (QA) It is not always clear how (or even if) an answer follows from their latent "beliefs"
arXiv Detail & Related papers (2023-05-23T17:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.