Related papers: ALMANACS: A Simulatability Benchmark for Language Model Explainability

ALMANACS: A Simulatability Benchmark for Language Model Explainability

URL: http://arxiv.org/abs/2312.12747v2
Date: Sun, 02 Feb 2025 09:16:25 GMT
Title: ALMANACS: A Simulatability Benchmark for Language Model Explainability
Authors: Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons,
Abstract summary: We present ALMANACS, a language model explainability benchmark.<n>AlMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs.<n>By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark.
Score: 9.037709044327066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. While not a replacement for human evaluations, we aim for ALMANACS to be a complementary, automated tool that allows for fast, scalable evaluation. Using ALMANACS, we evaluate counterfactual, rationalization, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge.

Related papers

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction [2.5450067638785945]
This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus.<n>We evaluate open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios.
arXiv Detail & Related papers (2026-02-26T08:40:21Z)
From Features to Actions: Explainability in Traditional and Agentic AI Systems [8.859406164948718]
We bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics.<n>Our results show that trace-based diagnostics for agentic settings consistently localizes behaviour breakdowns.
arXiv Detail & Related papers (2026-02-06T16:34:29Z)
Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations [1.8772057593980798]
Large Language Models (LLMs) can produce verbalized self-explanations.<n>We evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions.
arXiv Detail & Related papers (2026-01-07T10:13:26Z)
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [89.97082652805904]
We propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value. We empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.
arXiv Detail & Related papers (2025-02-04T18:58:31Z)
An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems [0.3480973072524161]
Recent research in explainability has focused on explaining the workings of AI models or model explainability. This thesis seeks to bridge some gaps between model and user-centered explainability.
arXiv Detail & Related papers (2024-10-23T02:03:49Z)
Comparing zero-shot self-explanations with human rationales in text classification [5.32539007352208]
We evaluate self-explanations with respect to their plausibility to humans and their faithfulness to models. We show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
arXiv Detail & Related papers (2024-10-04T10:14:12Z)
Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA [7.141288053123662]
Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations
arXiv Detail & Related papers (2023-12-21T05:51:55Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
FIND: A Function Description Benchmark for Evaluating Interpretability Methods [86.80718559904854]
This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
arXiv Detail & Related papers (2023-09-07T17:47:26Z)
Counterfactuals of Counterfactuals: a back-translation-inspired approach to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations. We propose a new back translation-inspired evaluation methodology. We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z)
MaNtLE: Model-agnostic Natural Language Explainer [9.43206883360088]
We introduce MaNtLE, a model-agnostic natural language explainer that analyzes multiple classifier predictions. MaNtLE uses multi-task training on thousands of synthetic classification tasks to generate faithful explanations. Simulated user studies indicate that, on average, MaNtLE-generated explanations are at least 11% more faithful compared to LIME and Anchors explanations.
arXiv Detail & Related papers (2023-05-22T12:58:06Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult. We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)
Prompting Contrastive Explanations for Commonsense Reasoning Tasks [74.7346558082693]
Large pretrained language models (PLMs) can achieve near-human performance on commonsense reasoning tasks. We show how to use these same models to generate human-interpretable evidence.
arXiv Detail & Related papers (2021-06-12T17:06:13Z)
Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability. Clear evidence of method effectiveness is found in very few cases. Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.