Related papers: On Quantitative Evaluations of Counterfactuals

On Quantitative Evaluations of Counterfactuals

URL: http://arxiv.org/abs/2111.00177v1
Date: Sat, 30 Oct 2021 05:00:36 GMT
Title: On Quantitative Evaluations of Counterfactuals
Authors: Frederik Hvilsh{\o}j and Alexandros Iosifidis and Ira Assent
Abstract summary: This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments. We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases. We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
Score: 88.42660013773647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As counterfactual examples become increasingly popular for explaining decisions of deep learning models, it is essential to understand what properties quantitative evaluation metrics do capture and equally important what they do not capture. Currently, such understanding is lacking, potentially slowing down scientific progress. In this paper, we consolidate the work on evaluating visual counterfactual examples through an analysis and experiments. We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases. We observe experimentally that metrics give good scores to tiny adversarial-like changes, wrongly identifying such changes as superior counterfactual examples. To mitigate this issue, we propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes. We conclude that a proper quantitative evaluation of visual counterfactual examples should combine metrics to ensure that all aspects of good counterfactuals are quantified.

Related papers

Weakly-Supervised Contrastive Learning for Imprecise Class Labels [50.57424331797865]
We introduce the concept of continuous semantic similarity'' to define positive and negative pairs.<n>We propose a graph-theoretic framework for weakly-supervised contrastive learning.<n>Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios.
arXiv Detail & Related papers (2025-05-28T06:50:40Z)
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [21.650619533772232]
This work investigates whether and to what degree superficial attributes of summary texts suffice to predict factuality'' We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate factuality'' scores by appending innocuous sentences to generated summaries.
arXiv Detail & Related papers (2024-11-25T18:15:15Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Men Also Do Laundry: Multi-Attribute Bias Amplification [2.492300648514129]
Computer vision systems are not only reproducing but amplifying harmful social biases. We propose a new metric: Multi-Attribute Bias Amplification. We validate our proposed metric through an analysis of gender bias amplification on the COCO and imSitu datasets.
arXiv Detail & Related papers (2022-10-21T12:50:15Z)
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
Investigating the Role of Negatives in Contrastive Representation Learning [59.30700308648194]
Noise contrastive learning is a popular technique for unsupervised representation learning. We focus on disambiguating the role of one of these parameters: the number of negative examples. We find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives.
arXiv Detail & Related papers (2021-06-18T06:44:16Z)
Rethinking Automatic Evaluation in Sentence Simplification [10.398614920404727]
We propose a simple modification of QuestEval allowing it to tackle Sentence Simplification. We show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI. We release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans.
arXiv Detail & Related papers (2021-04-15T16:13:50Z)
Measuring Disentanglement: A Review of Metrics [2.959278299317192]
Learning to disentangle and represent factors of variation in data is an important problem in AI. We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based. We conduct extensive experiments, where we isolate representation properties to compare all metrics on many aspects.
arXiv Detail & Related papers (2020-12-16T21:28:25Z)
Tweet Sentiment Quantification: An Experimental Re-Evaluation [88.60021378715636]
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called prevalence'') of sentiment-related classes. We re-evaluate those quantification methods following a now consolidated and much more robust experimental protocol. Results are dramatically different from those obtained by Gao Gao Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
arXiv Detail & Related papers (2020-11-04T21:41:34Z)
Weakly-Supervised Disentanglement Without Compromises [53.55580957483103]
Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. We show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations.
arXiv Detail & Related papers (2020-02-07T16:39:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.