Related papers: LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

URL: http://arxiv.org/abs/2509.09396v1
Date: Thu, 11 Sep 2025 12:25:41 GMT
Title: LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Authors: Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi,
Abstract summary: To collaborate effectively with humans, language models must be able to explain their decisions in natural language.<n>We study a specific type of self-generated counterfactual explanations (SCEs)<n>We evaluate whether models can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary.
Score: 8.734404327315291
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.

Related papers

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks.<n>LLMs outputs vary significantly depending on the implementation choices made by researchers.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors.
arXiv Detail & Related papers (2025-09-10T17:58:53Z)
CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking [16.10780837612994]
We present CANDY, a benchmark designed to evaluate the capabilities and limitations of large language models (LLMs) in fact-checking Chinese misinformation.<n>Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting.<n>Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios.
arXiv Detail & Related papers (2025-09-04T07:33:44Z)
Towards Large Language Models with Self-Consistent Natural Language Explanations [11.085839471231552]
Large language models (LLMs) seem to offer an easy path to interpretability.<n>Yet, studies show that these post-hoc explanations often misrepresent the true decision process.
arXiv Detail & Related papers (2025-06-09T08:06:33Z)
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs [14.334903198382287]
It remains unclear whether large language models can produce outputs aligned with a broad variety of user goals.<n> Interventions to improve steerability, such as prompt engineering, have varying effectiveness.<n>Even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.
arXiv Detail & Related papers (2025-05-27T21:29:52Z)
Can LLMs Explain Themselves Counterfactually? [16.569180690291773]
Explanations are an important tool for gaining insights into the behavior of ML models.<n>We study a specific type of self-explanations, self-generated counterfactual explanations (SCEs)
arXiv Detail & Related papers (2025-02-25T12:40:41Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.