Related papers: Towards Large Language Models with Self-Consistent Natural Language Explanations

Towards Large Language Models with Self-Consistent Natural Language Explanations

URL: http://arxiv.org/abs/2506.07523v2
Date: Thu, 12 Jun 2025 08:54:59 GMT
Title: Towards Large Language Models with Self-Consistent Natural Language Explanations
Authors: Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser,
Abstract summary: Large language models (LLMs) seem to offer an easy path to interpretability.<n>Yet, studies show that these post-hoc explanations often misrepresent the true decision process.
Score: 11.085839471231552
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.

Related papers

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
DecisionFlow: Advancing Large Language Model as Principled Decision Maker [48.654276010223384]
DecisionFlow is a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints.<n>Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function.<n> Empirical results show that DecisionFlow achieves up to 30% accuracy gains over strong prompting baselines.
arXiv Detail & Related papers (2025-05-27T16:23:53Z)
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations [30.68740512996253]
Chain-of-thought explanations are widely used to inspect the decision process of large language models.<n>We show that preference optimization can inadvertently reduce the faithfulness of these explanations.
arXiv Detail & Related papers (2025-04-07T17:49:23Z)
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction [34.86855316803838]
We identify a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall.<n>We show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation.
arXiv Detail & Related papers (2025-03-29T14:00:44Z)
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models [55.332004960574004]
Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established.<n>This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt.<n>We propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty.
arXiv Detail & Related papers (2024-07-20T11:19:58Z)
Evaluating the Reliability of Self-Explanations in Large Language Models [2.8894038270224867]
We evaluate two kinds of such self-explanations - extractive and counterfactual.<n>Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process.<n>We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results.
arXiv Detail & Related papers (2024-07-19T17:41:08Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs. We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z)
FaithLM: Towards Faithful Explanations for Large Language Models [67.29893340289779]
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their internal knowledge and reasoning capabilities. The black-box nature of these models complicates the task of explaining their decision-making processes. We introduce FaithLM to explain the decision of LLMs with natural language (NL) explanations.
arXiv Detail & Related papers (2024-02-07T09:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.