Goodhart's Law Applies to NLP's Explanation Benchmarks
- URL: http://arxiv.org/abs/2308.14272v1
- Date: Mon, 28 Aug 2023 03:03:03 GMT
- Title: Goodhart's Law Applies to NLP's Explanation Benchmarks
- Authors: Jennifer Hsia, Danish Pruthi, Aarti Singh, Zachary C. Lipton
- Abstract summary: We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
- Score: 57.26445915212884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the rising popularity of saliency-based explanations, the research
community remains at an impasse, facing doubts concerning their purpose,
efficacy, and tendency to contradict each other. Seeking to unite the
community's efforts around common goals, several recent works have proposed
evaluation metrics. In this paper, we critically examine two sets of metrics:
the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics,
focusing our inquiry on natural language processing. First, we show that we can
inflate a model's comprehensiveness and sufficiency scores dramatically without
altering its predictions or explanations on in-distribution test inputs. Our
strategy exploits the tendency for extracted explanations and their complements
to be "out-of-support" relative to each other and in-distribution inputs. Next,
we demonstrate that the EVAL-X metrics can be inflated arbitrarily by a simple
method that encodes the label, even though EVAL-X is precisely motivated to
address such exploits. Our results raise doubts about the ability of current
metrics to guide explainability research, underscoring the need for a broader
reassessment of what precisely these metrics are intended to capture.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models [24.144513068228903]
We introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions.
Our metric accounts for the total shift in the model's predicted label distribution.
We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test.
arXiv Detail & Related papers (2024-04-04T04:20:04Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Uncertain Facial Expression Recognition via Multi-task Assisted
Correction [43.02119884581332]
We propose a novel method of multi-task assisted correction in addressing uncertain facial expression recognition called MTAC.
Specifically, a confidence estimation block and a weighted regularization module are applied to highlight solid samples and suppress uncertain samples in every batch.
Experiments on RAF-DB, AffectNet, and AffWild2 datasets demonstrate that the MTAC obtains substantial improvements over baselines when facing synthetic and real uncertainties.
arXiv Detail & Related papers (2022-12-14T10:28:08Z) - Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense
Reasoning [85.1541170468617]
This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC)
Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets.
arXiv Detail & Related papers (2022-08-23T14:42:14Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Counterfactual Evaluation for Explainable AI [21.055319253405603]
We propose a new methodology to evaluate the faithfulness of explanations from the textitcounterfactual reasoning perspective.
We introduce two algorithms to find the proper counterfactuals in both discrete and continuous scenarios and then use the acquired counterfactuals to measure faithfulness.
arXiv Detail & Related papers (2021-09-05T01:38:49Z) - Understanding Factuality in Abstractive Summarization with FRANK: A
Benchmark for Factuality Metrics [17.677637487977208]
Modern summarization models generate highly fluent but often factually unreliable outputs.
Due to the lack of common benchmarks, metrics attempting to measure the factuality of automatically generated summaries cannot be compared.
We devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems.
arXiv Detail & Related papers (2021-04-27T17:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.