Related papers: How (Not) To Evaluate Explanation Quality

How (Not) To Evaluate Explanation Quality

URL: http://arxiv.org/abs/2210.07126v1
Date: Thu, 13 Oct 2022 16:06:59 GMT
Title: How (Not) To Evaluate Explanation Quality
Authors: Hendrik Schuff, Heike Adel, Peng Qi, Ngoc Thang Vu
Abstract summary: We formulate desired characteristics of explanation quality that apply across tasks and domains. We propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality.
Score: 29.40729766120284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The importance of explainability is increasingly acknowledged in natural language processing. However, it is still unclear how the quality of explanations can be assessed effectively. The predominant approach is to compare proxy scores (such as BLEU or explanation F1) evaluated against gold explanations in the dataset. The assumption is that an increase of the proxy score implies a higher utility of explanations to users. In this paper, we question this assumption. In particular, we (i) formulate desired characteristics of explanation quality that apply across tasks and domains, (ii) point out how current evaluation practices violate those characteristics, and (iii) propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality and to enable the development of explainable systems that provide tangible benefits for human users. We substantiate our theoretical claims (i.e., the lack of validity and temporal decline of currently-used proxy scores) with empirical evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems.

Related papers

FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking [2.0140898354987353]
This paper introduces FinGrAct, a fine-grained evaluation framework that can access the web. It is designed to assess actionability in Automatic Fact-Checking explanations through well-defined criteria and an evaluation dataset. FinGrAct surpasses state-of-the-art evaluators, achieving the highest Pearson and Kendall correlation with human judgments.
arXiv Detail & Related papers (2025-04-07T16:14:27Z)
Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution [3.0658381192498907]
XAI practitioners rely on measures to gauge the quality of such explanations. Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE)
arXiv Detail & Related papers (2025-02-21T12:04:01Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Explainability for Transparent Conversational Information-Seeking [13.790574266700006]
This study explores different methods of explaining the responses. By exploring transparency across explanation type, quality, and presentation mode, this research aims to bridge the gap between system-generated responses and responses verifiable by the user.
arXiv Detail & Related papers (2024-05-06T09:25:14Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
On the stability, correctness and plausibility of visual explanation methods based on feature importance [0.0]
We study the articulation between the stability, correctness and plausibility of explanations based on feature importance for image classifiers. We show that the existing metrics for evaluating these properties do not always agree, raising the issue of what constitutes a good evaluation metric for explanations.
arXiv Detail & Related papers (2023-10-25T08:59:21Z)
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z)
Explainability in Process Outcome Prediction: Guidelines to Obtain Interpretable and Faithful Models [77.34726150561087]
We define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction. This paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications.
arXiv Detail & Related papers (2022-03-30T05:59:50Z)
Diagnostics-Guided Explanation Generation [32.97930902104502]
Explanations shed light on a machine learning model's rationales and can aid in identifying deficiencies in its reasoning process. We show how to optimise for several diagnostic properties when training a model to generate sentence-level explanations.
arXiv Detail & Related papers (2021-09-08T16:27:52Z)
Prompting Contrastive Explanations for Commonsense Reasoning Tasks [74.7346558082693]
Large pretrained language models (PLMs) can achieve near-human performance on commonsense reasoning tasks. We show how to use these same models to generate human-interpretable evidence.
arXiv Detail & Related papers (2021-06-12T17:06:13Z)
Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards [0.0]
An emerging line of research in Explainable NLP is the creation of datasets enriched with human-annotated explanations and rationales. While human-annotated explanations are used as ground-truth for the inference, there is a lack of systematic assessment of their consistency and rigour. We propose a systematic annotation methodology, named Explanation Entailment Verification (EEV), to quantify the logical validity of human-annotated explanations.
arXiv Detail & Related papers (2021-05-05T10:59:26Z)
Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA [22.76153284711981]
We study whether explanations help users correctly decide when to accept or reject an ODQA system's answer. Our results show that explanations derived from retrieved evidence passages can outperform strong baselines (calibrated confidence) across modalities. We show common failure cases of current explanations, emphasize end-to-end evaluation of explanations, and caution against evaluating them in proxy modalities that are different from deployment.
arXiv Detail & Related papers (2020-12-30T08:19:02Z)
Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis. We obtain new explanations that are loosely necessary and sufficient for a prediction. We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.