Related papers: Lazy Data Practices Harm Fairness Research

Lazy Data Practices Harm Fairness Research

URL: http://arxiv.org/abs/2404.17293v2
Date: Wed, 19 Jun 2024 00:52:16 GMT
Title: Lazy Data Practices Harm Fairness Research
Authors: Jan Simson, Alessandro Fabris, Christoph Kern,
Abstract summary: We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings. Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
Score: 49.02318458244464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

Related papers

Beyond Internal Data: Bounding and Estimating Fairness from Incomplete Data [26.037607208689977]
In high-stakes domains such as lending, hiring, and healthcare, ensuring fairness in AI systems is critical.<n>In industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities.<n>Our work seeks to leverage such available separate data to estimate model fairness when complete data is inaccessible.
arXiv Detail & Related papers (2025-08-18T15:57:30Z)
Beyond Internal Data: Constructing Complete Datasets for Fairness Testing [26.037607208689977]
This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible.<n>We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information.<n>We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data.
arXiv Detail & Related papers (2025-07-24T16:35:42Z)
Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z)
Targeted Learning for Data Fairness [52.59573714151884]
We expand fairness inference by evaluating fairness in the data generating process itself. We derive estimators demographic parity, equal opportunity, and conditional mutual information. To validate our approach, we perform several simulations and apply our estimators to real data.
arXiv Detail & Related papers (2025-02-06T18:51:28Z)
Data-Driven Fairness Generalization for Deepfake Detection [1.2221087476416053]
biases in the training data for deepfake detection can result in varying levels of performance across different demographic groups. We propose a data-driven framework for tackling the fairness generalization problem in deepfake detection by leveraging synthetic datasets and model optimization.
arXiv Detail & Related papers (2024-12-21T01:28:35Z)
Fairness Issues and Mitigations in (Differentially Private) Socio-demographic Data Processes [43.07159967207698]
This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates. To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes. Privacy-preserving methods used to determine sampling rates can further impact these fairness issues.
arXiv Detail & Related papers (2024-08-16T01:13:36Z)
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations [63.52709761339949]
We first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. We design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates.
arXiv Detail & Related papers (2024-07-19T14:53:18Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
AIM: Attributing, Interpreting, Mitigating Data Unfairness [40.351282126410545]
Existing fair machine learning (FairML) research has predominantly focused on mitigating discriminative bias in the model prediction. We investigate a novel research problem: discovering samples that reflect biases/prejudices from the training data. We propose practical algorithms for measuring and countering sample bias.
arXiv Detail & Related papers (2024-06-13T05:21:10Z)
Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner. We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z)
Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness [15.210232622716129]
Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes. Data augmentation reduces gender bias by adding counterfactual examples to the training dataset. We show that some of the examples in the augmented dataset can be not important or even harmful for fairness.
arXiv Detail & Related papers (2022-11-20T22:42:30Z)
Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations. Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z)
A survey on datasets for fairness-aware machine learning [6.962333053044713]
A large variety of fairness-aware machine learning solutions have been proposed. In this paper, we overview real-world datasets used for fairness-aware machine learning. For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
arXiv Detail & Related papers (2021-10-01T16:54:04Z)
Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification [75.49600684537117]
Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness. We contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability. Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance.
arXiv Detail & Related papers (2021-01-18T22:55:40Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.