Quantitative Evaluations on Saliency Methods: An Experimental Study
- URL: http://arxiv.org/abs/2012.15616v1
- Date: Thu, 31 Dec 2020 14:13:30 GMT
- Title: Quantitative Evaluations on Saliency Methods: An Experimental Study
- Authors: Xiao-Hui Li, Yuhan Shi, Haoyang Li, Wei Bai, Yuanwei Song, Caleb Chen
Cao, Lei Chen
- Abstract summary: We briefly summarize the status quo of the metrics, including faithfulness, localization, false-positives, sensitivity check, and stability.
We conclude that among all the methods we compare, no single explanation method dominates others in all metrics.
- Score: 6.290238942982972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been long debated that eXplainable AI (XAI) is an important topic, but
it lacks rigorous definition and fair metrics. In this paper, we briefly
summarize the status quo of the metrics, along with an exhaustive experimental
study based on them, including faithfulness, localization, false-positives,
sensitivity check, and stability. With the experimental results, we conclude
that among all the methods we compare, no single explanation method dominates
others in all metrics. Nonetheless, Gradient-weighted Class Activation Mapping
(Grad-CAM) and Randomly Input Sampling for Explanation (RISE) perform fairly
well in most of the metrics. Utilizing a set of filtered metrics, we further
present a case study to diagnose the classification bases for models. While
providing a comprehensive experimental study of metrics, we also examine
measuring factors that are missed in current metrics and hope this valuable
work could serve as a guide for future research.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice [6.091702876917282]
Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous.
Many works use so-called'macro' metrics to rank systems but do not clearly specify what they would expect from such a metric.
arXiv Detail & Related papers (2024-04-25T18:12:43Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Faithful Model Evaluation for Model-Based Metrics [22.753929098534403]
We establish the mathematical foundation of significance testing for model-based metrics.
We show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.
arXiv Detail & Related papers (2023-12-19T19:41:33Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - Measuring Disentanglement: A Review of Metrics [2.959278299317192]
Learning to disentangle and represent factors of variation in data is an important problem in AI.
We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based.
We conduct extensive experiments, where we isolate representation properties to compare all metrics on many aspects.
arXiv Detail & Related papers (2020-12-16T21:28:25Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.