Related papers: Fairness-Aware Data Valuation for Supervised Learning

Fairness-Aware Data Valuation for Supervised Learning

URL: http://arxiv.org/abs/2303.16963v1
Date: Wed, 29 Mar 2023 18:51:13 GMT
Title: Fairness-Aware Data Valuation for Supervised Learning
Authors: Jos\'e Pombal, Pedro Saleiro, M\'ario A. T. Figueiredo, Pedro Bizarro
Abstract summary: We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks. We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques. Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
Score: 4.874780144224057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data valuation is a ML field that studies the value of training instances towards a given predictive task. Although data bias is one of the main sources of downstream model unfairness, previous work in data valuation does not consider how training instances may influence both performance and fairness of ML models. Thus, we propose Fairness-Aware Data vauatiOn (FADO), a data valuation framework that can be used to incorporate fairness concerns into a series of ML-related tasks (e.g., data pre-processing, exploratory data analysis, active learning). We propose an entropy-based data valuation metric suited to address our two-pronged goal of maximizing both performance and fairness, which is more computationally efficient than existing metrics. We then show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques. Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline -- and promote fairness in a data-centric way, where a deeper understanding of data quality takes center stage.

Related papers

Targeted Learning for Data Fairness [52.59573714151884]
We expand fairness inference by evaluating fairness in the data generating process itself. We derive estimators demographic parity, equal opportunity, and conditional mutual information. To validate our approach, we perform several simulations and apply our estimators to real data.
arXiv Detail & Related papers (2025-02-06T18:51:28Z)
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks [40.91931801667421]
This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization. As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task.
arXiv Detail & Related papers (2025-02-01T01:52:32Z)
Data Preparation for Fairness-Performance Trade-Offs: A Practitioner-Friendly Alternative? [11.172805305320592]
Pre-processing techniques, which mitigate bias before training, are effective but may impact model performance and pose integration difficulties. This report proposes an empirical evaluation of how optimally selected fairness-aware practices, applied in early ML lifecycle stages, can enhance both fairness and performance. Using FATE, we will analyze the fairness-performance trade-off, comparing pipelines selected by FATE with results by pre-processing bias mitigation techniques.
arXiv Detail & Related papers (2024-12-20T14:12:39Z)
Data Acquisition for Improving Model Fairness using Reinforcement Learning [3.3916160303055563]
We focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness. We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire. We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
arXiv Detail & Related papers (2024-12-04T03:56:54Z)
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field. The absence of systematic data quality checks poses complications for properly training and testing models. We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z)
Debiasing Machine Unlearning with Counterfactual Examples [31.931056076782202]
We analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels. We introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset. Our method outperforms existing machine unlearning baselines on evaluation metrics.
arXiv Detail & Related papers (2024-04-24T09:33:10Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Data vs. Model Machine Learning Fairness Testing: An Empirical Study [23.535630175567146]
We take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training. We evaluate the effectiveness of the proposed approach using an empirical analysis of the relationship between model dependent and independent fairness metrics. Our results indicate that testing for fairness prior to training can be a cheap'' and effective means of catching a biased data collection process early.
arXiv Detail & Related papers (2024-01-15T14:14:16Z)
LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
FORML: Learning to Reweight Data for Fairness [2.105564340986074]
We introduce Fairness Optimized Reweighting via Meta-Learning (FORML) FORML balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters. We show that FORML improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face prediction task.
arXiv Detail & Related papers (2022-02-03T17:36:07Z)
Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning. We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class. We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.