Related papers: Gender Biases in Automatic Evaluation Metrics for Image Captioning

Gender Biases in Automatic Evaluation Metrics for Image Captioning

URL: http://arxiv.org/abs/2305.14711v3
Date: Fri, 3 Nov 2023 00:50:25 GMT
Title: Gender Biases in Automatic Evaluation Metrics for Image Captioning
Authors: Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng
Abstract summary: We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
Score: 87.15170977240643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model-based evaluation metrics (e.g., CLIPScore and GPTScore) have demonstrated decent correlations with human judgments in various language generation tasks. However, their impact on fairness remains largely unexplored. It is widely recognized that pretrained models can inadvertently encode societal biases, thus employing these models for evaluation purposes may inadvertently perpetuate and amplify biases. For example, an evaluation metric may favor the caption "a woman is calculating an account book" over "a man is calculating an account book," even if the image only shows male accountants. In this paper, we conduct a systematic study of gender biases in model-based automatic evaluation metrics for image captioning tasks. We start by curating a dataset comprising profession, activity, and object concepts associated with stereotypical gender associations. Then, we demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations, as well as the propagation of biases to generation models through reinforcement learning. Finally, we present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments. Our dataset and framework lay the foundation for understanding the potential harm of model-based evaluation metrics, and facilitate future works to develop more inclusive evaluation metrics.

Related papers

Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation [17.53599375848065]
We present a localized counterfactual generation method that preserves image context. Our method results in higher visual and semantic fidelity than state-of-the-art alternatives. Models fine-tuned with our counterfactuals demonstrate measurable bias reduction across multiple metrics.
arXiv Detail & Related papers (2024-12-12T10:46:14Z)
Identifying and examining machine learning biases on Adult dataset [0.7856362837294112]
This research delves into the reduction of machine learning model bias through Ensemble Learning. Our rigorous methodology comprehensively assesses bias across various categorical variables, ultimately revealing a pronounced gender attribute bias. This study underscores ethical considerations and advocates the implementation of hybrid models for a data-driven society marked by inclusivity and impartiality.
arXiv Detail & Related papers (2023-10-13T19:41:47Z)
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets [52.77024349608834]
Vision-language models can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. COCO Captions is the most commonly used dataset for evaluating bias between background context and the gender of people in-situ. We propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets.
arXiv Detail & Related papers (2023-05-24T17:59:18Z)
Is Your Model "MADD"? A Novel Metric to Evaluate Algorithmic Fairness for Predictive Student Models [0.0]
We propose a novel metric, the Model Absolute Density Distance (MADD), to analyze models' discriminatory behaviors. We evaluate our approach on the common task of predicting student success in online courses, using several common predictive classification models.
arXiv Detail & Related papers (2023-05-24T16:55:49Z)
Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition [4.336779198334903]
One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models.
arXiv Detail & Related papers (2023-03-28T11:04:18Z)
Choose Your Lenses: Flaws in Gender Bias Evaluation [29.16221451643288]
We assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender. Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions.
arXiv Detail & Related papers (2022-10-20T17:59:55Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.