Related papers: Evaluating the Reliability of Self-Explanations in Large Language Models

Evaluating the Reliability of Self-Explanations in Large Language Models

URL: http://arxiv.org/abs/2407.14487v1
Date: Fri, 19 Jul 2024 17:41:08 GMT
Title: Evaluating the Reliability of Self-Explanations in Large Language Models
Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren,
Abstract summary: We evaluate two kinds of such self-explanations - extractive and counterfactual. Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results.
Score: 2.8894038270224867
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Related papers

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations [5.676319658620339]
How a language model predicts the next token and generates content is not generally understandable by humans.<n>This paper investigates local explainability and mechanistic interpretability within Transformer-based large language models.
arXiv Detail & Related papers (2025-10-20T07:43:53Z)
CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions. Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z)
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs [11.805264893752154]
We evaluate the reasoning capabilities of two large language models, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments, having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets.
arXiv Detail & Related papers (2024-10-26T15:09:07Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z)
Explainability for Machine Learning Models: From Data Adaptability to User Perception [0.8702432681310401]
This thesis explores the generation of local explanations for already deployed machine learning models. It aims to identify optimal conditions for producing meaningful explanations considering both data and user requirements.
arXiv Detail & Related papers (2024-02-16T18:44:37Z)
Inference to the Best Explanation in Large Language Models [6.037970847418495]
This paper proposes IBE-Eval, a framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features. Experiments reveal that IBE-Eval can successfully identify the best explanation with up to 77% accuracy.
arXiv Detail & Related papers (2024-02-16T15:41:23Z)
Counterfactuals of Counterfactuals: a back-translation-inspired approach to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations. We propose a new back translation-inspired evaluation methodology. We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z)
Complementary Explanations for Effective In-Context Learning [77.83124315634386]
Large language models (LLMs) have exhibited remarkable capabilities in learning from explanations in prompts. This work aims to better understand the mechanisms by which explanations are used for in-context learning.
arXiv Detail & Related papers (2022-11-25T04:40:47Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.