Related papers: On Measuring Faithfulness or Self-consistency of Natural Language Explanations

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

URL: http://arxiv.org/abs/2311.07466v3
Date: Sat, 1 Jun 2024 07:57:52 GMT
Title: On Measuring Faithfulness or Self-consistency of Natural Language Explanations
Authors: Letitia Parcalabescu, Anette Frank,
Abstract summary: Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought explanations. Recent work has designed tests that aim to judge the faithfulness of these explanations. We argue that these tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level.
Score: 22.37545779269458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model's input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. Our code is available at \url{https://github.com/Heidelberg-NLP/CC-SHAP}

Related papers

Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models [29.67884478799914]
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness.
arXiv Detail & Related papers (2024-10-18T03:45:42Z)
Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators [6.403926452181712]
Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. We present a survey and empirical comparison of estimators of factual confidence. Our experiments indicate that trained hidden-state probes provide the most reliable confidence estimates.
arXiv Detail & Related papers (2024-06-19T10:11:37Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models [24.144513068228903]
We introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Our metric accounts for the total shift in the model's predicted label distribution. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test.
arXiv Detail & Related papers (2024-04-04T04:20:04Z)
Are self-explanations from Large Language Models faithful? [35.40666730867487]
Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. It's important to measure if self-explanations truly reflect the model's behavior. We propose employing self-consistency checks to measure faithfulness.
arXiv Detail & Related papers (2024-01-15T19:39:15Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting [43.458726163197824]
Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output. We find that CoT explanations can systematically misrepresent the true reason for a model's prediction.
arXiv Detail & Related papers (2023-05-07T22:44:25Z)
The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions [93.62888099134028]
We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? We give both theoretical explanations and empirical evidence regarding the source of the instability.
arXiv Detail & Related papers (2020-04-28T15:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.