How can I choose an explainer? An Application-grounded Evaluation of
Post-hoc Explanations
- URL: http://arxiv.org/abs/2101.08758v2
- Date: Fri, 22 Jan 2021 12:05:16 GMT
- Title: How can I choose an explainer? An Application-grounded Evaluation of
Post-hoc Explanations
- Authors: S\'ergio Jesus, Catarina Bel\'em, Vladimir Balayan, Jo\~ao Bento,
Pedro Saleiro, Pedro Bizarro, Jo\~ao Gama
- Abstract summary: Explanations are seldom evaluated based on their true practical impact on decision-making tasks.
This study proposes XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information.
Using strong statistical analysis, we show that, in general, popular explainers have a worse impact than desired.
- Score: 2.7708222692419735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been several research works proposing new Explainable AI (XAI)
methods designed to generate model explanations having specific properties, or
desiderata, such as fidelity, robustness, or human-interpretability. However,
explanations are seldom evaluated based on their true practical impact on
decision-making tasks. Without that assessment, explanations might be chosen
that, in fact, hurt the overall performance of the combined system of ML model
+ end-users. This study aims to bridge this gap by proposing XAI Test, an
application-grounded evaluation methodology tailored to isolate the impact of
providing the end-user with different levels of information. We conducted an
experiment following XAI Test to evaluate three popular post-hoc explanation
methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection
task, with real data, a deployed ML model, and fraud analysts. During the
experiment, we gradually increased the information provided to the fraud
analysts in three stages: Data Only, i.e., just transaction data without access
to model score nor explanations, Data + ML Model Score, and Data + ML Model
Score + Explanations. Using strong statistical analysis, we show that, in
general, these popular explainers have a worse impact than desired. Some of the
conclusion highlights include: i) showing Data Only results in the highest
decision accuracy and the slowest decision time among all variants tested, ii)
all the explainers improve accuracy over the Data + ML Model Score variant but
still result in lower accuracy when compared with Data Only; iii) LIME was the
least preferred by users, probably due to its substantially lower variability
of explanations from case to case.
Related papers
- Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.
We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.
The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z) - F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI [15.314388210699443]
Fine-tuned Fidelity F-Fidelity is a robust evaluation framework for XAI.
We show that F-Fidelity significantly improves upon prior evaluation metrics in recovering the ground-truth ranking of explainers.
We also show that given a faithful explainer, F-Fidelity metric can be used to compute the sparsity of influential input components.
arXiv Detail & Related papers (2024-10-03T20:23:06Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Using LLMs for Explaining Sets of Counterfactual Examples to Final Users [0.0]
In automated decision-making scenarios, causal inference methods can analyze the underlying data-generation process.
Counterfactual examples explore hypothetical scenarios where a minimal number of factors are altered.
We propose a novel multi-step pipeline that uses counterfactuals to generate natural language explanations of actions that will lead to a change in outcome.
arXiv Detail & Related papers (2024-08-27T15:13:06Z) - Analyzing the Influence of Training Samples on Explanations [5.695152528716705]
We propose a novel problem of identifying training data samples that have a high influence on a given explanation.
For this, we propose an algorithm that identifies such influential training samples.
arXiv Detail & Related papers (2024-06-05T07:20:06Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - PhilaeX: Explaining the Failure and Success of AI Models in Malware
Detection [6.264663726458324]
An explanation to an AI model's prediction used to support decision making in cyber security, is of critical importance.
Most existing AI models lack the ability to provide explanations on their prediction results, despite their strong performance in most scenarios.
We propose a novel explainable AI method, called PhilaeX, that provides the means to identify the optimized subset of features to form the complete explanations of AI models' predictions.
arXiv Detail & Related papers (2022-07-02T05:06:24Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial
Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations.
LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output.
We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z) - Evaluating Explainable AI: Which Algorithmic Explanations Help Users
Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability.
Clear evidence of method effectiveness is found in very few cases.
Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.