A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
- URL: http://arxiv.org/abs/2602.02639v1
- Date: Mon, 02 Feb 2026 18:54:51 GMT
- Title: A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
- Authors: Harry Mayne, Justin Singh Kang, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel,
- Abstract summary: LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood.<n>We introduce Normalized Simulata Gainbility (NSG), a metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria.<n>We find self-explanations substantially improve prediction of model behavior (11-37% NSG)
- Score: 11.616524876789624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.
Related papers
- Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment [0.3823356975862005]
We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment.<n>Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts.<n>Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
arXiv Detail & Related papers (2026-02-16T14:29:46Z) - Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations [1.8772057593980798]
Large Language Models (LLMs) can produce verbalized self-explanations.<n>We evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions.
arXiv Detail & Related papers (2026-01-07T10:13:26Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - XForecast: Evaluating Natural Language Explanations for Time Series Forecasting [72.57427992446698]
Time series forecasting aids decision-making, especially for stakeholders who rely on accurate predictions.
Traditional explainable AI (XAI) methods, which underline feature or temporal importance, often require expert knowledge.
evaluating forecast NLEs is difficult due to the complex causal relationships in time series data.
arXiv Detail & Related papers (2024-10-18T05:16:39Z) - Take It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation [15.94564349084642]
Self-rationalization method is typically used in natural language inference tasks.
We fine-tune a model to learn veracity prediction with annotated labels.
We generate synthetic explanations from three large language models.
arXiv Detail & Related papers (2024-10-05T02:19:49Z) - Are self-explanations from Large Language Models faithful? [35.40666730867487]
Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations.
It's important to measure if self-explanations truly reflect the model's behavior.
We propose employing self-consistency checks to measure faithfulness.
arXiv Detail & Related papers (2024-01-15T19:39:15Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - PhilaeX: Explaining the Failure and Success of AI Models in Malware
Detection [6.264663726458324]
An explanation to an AI model's prediction used to support decision making in cyber security, is of critical importance.
Most existing AI models lack the ability to provide explanations on their prediction results, despite their strong performance in most scenarios.
We propose a novel explainable AI method, called PhilaeX, that provides the means to identify the optimized subset of features to form the complete explanations of AI models' predictions.
arXiv Detail & Related papers (2022-07-02T05:06:24Z) - VisFIS: Visual Feature Importance Supervision with
Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics.
Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets.
Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.