Related papers: Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

URL: http://arxiv.org/abs/2506.09277v2
Date: Thu, 12 Jun 2025 13:30:28 GMT
Title: Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models
Authors: Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot,
Abstract summary: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers.<n>This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE.<n>The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning.
Score: 9.499055857747322
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

Related papers

Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability [2.566497773003048]
We argue that neuron identification can be viewed as the inverse process of machine learning.<n>We present the first theoretical analysis of two fundamental challenges: faithfulness and stability.<n> Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method.
arXiv Detail & Related papers (2025-12-19T21:55:17Z)
Self-Critique and Refinement for Faithful Natural Language Explanations [15.04835537752639]
We introduce Self-critique and Refinement for Natural Language Explanations.<n>This framework enables models to improve the faithfulness of their own explanations.<n>We show that SR-NLE significantly reduces unfaithfulness rates.
arXiv Detail & Related papers (2025-05-28T20:08:42Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning [51.22185316175418]
We present a new form of predictive coding that we call meta-representational predictive coding (MPC)<n>MPC sidesteps the need for learning a generative model of sensory input by learning to predict representations of sensory input across parallel streams.
arXiv Detail & Related papers (2025-03-22T22:13:14Z)
A Comprehensive Survey on Self-Interpretable Neural Networks [36.0575431131253]
Self-interpretable neural networks inherently reveal the prediction rationale through the model structures.<n>We first collect and review existing works on self-interpretable neural networks and provide a structured summary of their methodologies.<n>We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios.
arXiv Detail & Related papers (2025-01-26T18:50:16Z)
Towards Logically Consistent Language Models via Probabilistic Reasoning [14.317886666902822]
Large language models (LLMs) are a promising venue for natural language understanding and generation tasks. LLMs are prone to generate non-factual information and to contradict themselves when prompted to reason about beliefs of the world. We introduce a training objective that teaches a LLM to be consistent with external knowledge in the form of a set of facts and rules.
arXiv Detail & Related papers (2024-04-19T12:23:57Z)
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models [24.144513068228903]
We introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Our metric accounts for the total shift in the model's predicted label distribution. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test.
arXiv Detail & Related papers (2024-04-04T04:20:04Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
FaithLM: Towards Faithful Explanations for Large Language Models [67.29893340289779]
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their internal knowledge and reasoning capabilities. The black-box nature of these models complicates the task of explaining their decision-making processes. We introduce FaithLM to explain the decision of LLMs with natural language (NL) explanations.
arXiv Detail & Related papers (2024-02-07T09:09:14Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities. Concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
Faithfulness Tests for Natural Language Explanations [87.01093277918599]
Explanations of neural models aim to reveal a model's decision-making process for its predictions. Recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading. This work explores the challenging question of evaluating the faithfulness of natural language explanations.
arXiv Detail & Related papers (2023-05-29T11:40:37Z)
Benchmarking Faithfulness: Towards Accurate Natural Language Explanations in Vision-Language Tasks [0.0]
Natural language explanations (NLEs) promise to enable the communication of a model's decision-making in an easily intelligible way. While current models successfully generate convincing explanations, it is an open question how well the NLEs actually represent the reasoning process of the models. We propose three faithfulness metrics: Attribution-Similarity, NLE-Sufficiency, and NLE-Comprehensiveness.
arXiv Detail & Related papers (2023-04-03T08:24:10Z)
NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning [59.16962123636579]
This paper proposes a new take on Prolog-based inference engines. We replace handcrafted rules with a combination of neural language modeling, guided generation, and semi dense retrieval. Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA.
arXiv Detail & Related papers (2022-09-16T00:54:44Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
How Much Can I Trust You? -- Quantifying Uncertainties in Explaining Neural Networks [19.648814035399013]
Explainable AI (XAI) aims to provide interpretations for predictions made by learning machines, such as deep neural networks. We propose a new framework that allows to convert any arbitrary explanation method for neural networks into an explanation method for Bayesian neural networks. We demonstrate the effectiveness and usefulness of our approach extensively in various experiments.
arXiv Detail & Related papers (2020-06-16T08:54:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.