Related papers: Counterfactual Simulatability of LLM Explanations for Generation Tasks

Counterfactual Simulatability of LLM Explanations for Generation Tasks

URL: http://arxiv.org/abs/2505.21740v1
Date: Tue, 27 May 2025 20:29:50 GMT
Title: Counterfactual Simulatability of LLM Explanations for Generation Tasks
Authors: Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown,
Abstract summary: The ability of models to accurately explain their behavior is critical, especially in high-stakes settings.<n>Counterfactual simulatability is how well an explanation allows users to infer the model's output on related counterfactuals.<n>Our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
Score: 15.969128610152586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

Related papers

Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations [1.8772057593980798]
Large Language Models (LLMs) can produce verbalized self-explanations.<n>We evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions.
arXiv Detail & Related papers (2026-01-07T10:13:26Z)
Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews [16.394933051332657]
Likert-scale rating prediction is a regression task that requires both language and mathematical reasoning to be solved effectively.<n>This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information.<n>We find that user-written reviews significantly improve the rating prediction performance of LLMs.
arXiv Detail & Related papers (2025-10-01T03:04:20Z)
Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following [12.668201122427101]
Large Language Models (LLMs) are increasingly widely used to simulate personas in virtual environments.<n>We show that even state-of-the-art LLMs cannot simulate personas with reversed performance.
arXiv Detail & Related papers (2025-04-08T22:00:32Z)
CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.<n>We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z)
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering [56.710375516257876]
We propose to map hidden states to interpretable visual and textual concepts.<n>This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model.<n>We also demonstrate the use of shift vectors to capture these concepts changes.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
Eliciting Causal Abilities in Large Language Models for Reasoning Tasks [14.512834333917414]
We introduce the Self-Causal Instruction Enhancement (SCIE) method, which enables LLMs to generate high-quality, low-quantity observational data.<n>In SCIE, the instructions are treated as the treatment, and textual features are used to process natural language.<n>Our method effectively generates instructions that enhance reasoning performance with reduced training cost of prompts.
arXiv Detail & Related papers (2024-12-19T17:03:02Z)
Using LLMs for Explaining Sets of Counterfactual Examples to Final Users [0.0]
In automated decision-making scenarios, causal inference methods can analyze the underlying data-generation process. Counterfactual examples explore hypothetical scenarios where a minimal number of factors are altered. We propose a novel multi-step pipeline that uses counterfactuals to generate natural language explanations of actions that will lead to a change in outcome.
arXiv Detail & Related papers (2024-08-27T15:13:06Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions. A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations. Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.