Related papers: On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

URL: http://arxiv.org/abs/2203.13928v1
Date: Fri, 25 Mar 2022 22:17:43 GMT
Title: On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations
Authors: Yang Trista Cao and Yada Pruksachatkun and Kai-Wei Chang and Rahul Gupta and Varun Kumar and Jwala Dhamala and Aram Galstyan
Abstract summary: Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
Score: 74.70957445600936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) \emph{extrinsic metrics} for evaluating fairness in downstream applications and 2) \emph{intrinsic metrics} for estimating fairness in upstream contextualized language representation models. In this paper, we conduct an extensive correlation study between intrinsic and extrinsic metrics across bias notions using 19 contextualized language models. We find that intrinsic and extrinsic metrics do not necessarily correlate in their original setting, even when correcting for metric misalignments, noise in evaluation datasets, and confounding factors such as experiment configuration for extrinsic metrics. %al

Related papers

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy [52.261323452286554]
We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts.
arXiv Detail & Related papers (2025-03-25T16:42:25Z)
Assessing the Alignment of FOL Closeness Metrics with Human Judgement [9.100564948718887]
We study sensitivity of existing metrics and their alignment with human judgement on FOL evaluation. We show that combining metrics enhances both alignment and sensitivity compared to using individual metrics.
arXiv Detail & Related papers (2025-01-15T06:22:35Z)
ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language [40.4052848203136]
Implicit language is essential for natural language processing systems to achieve precise text understanding and facilitate natural interactions with users. This paper develops a scalar metric that quantifies the implicitness level of language without relying on external references. ImpScore is trained using pairwise contrastive learning on a specially curated dataset comprising $112,580$ (implicit sentence, explicit sentence) pairs.
arXiv Detail & Related papers (2024-11-07T20:23:29Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words. We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora [5.254054636427663]
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics.
arXiv Detail & Related papers (2022-11-29T14:47:07Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Measuring Fairness with Biased Rulers: A Survey on Quantifying Biases in Pretrained Language Models [2.567384209291337]
An increasing awareness of biased patterns in natural language processing resources has motivated many metrics to quantify bias' and fairness' We survey the existing literature on fairness metrics for pretrained language models and experimentally evaluate compatibility. We find that many metrics are not compatible and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings.
arXiv Detail & Related papers (2021-12-14T15:04:56Z)
LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation. This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics. Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.