Fairness-Aware Data Valuation for Supervised Learning
- URL: http://arxiv.org/abs/2303.16963v1
- Date: Wed, 29 Mar 2023 18:51:13 GMT
- Title: Fairness-Aware Data Valuation for Supervised Learning
- Authors: Jos\'e Pombal, Pedro Saleiro, M\'ario A. T. Figueiredo, Pedro Bizarro
- Abstract summary: We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks.
We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques.
Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
- Score: 4.874780144224057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation is a ML field that studies the value of training instances
towards a given predictive task. Although data bias is one of the main sources
of downstream model unfairness, previous work in data valuation does not
consider how training instances may influence both performance and fairness of
ML models. Thus, we propose Fairness-Aware Data vauatiOn (FADO), a data
valuation framework that can be used to incorporate fairness concerns into a
series of ML-related tasks (e.g., data pre-processing, exploratory data
analysis, active learning). We propose an entropy-based data valuation metric
suited to address our two-pronged goal of maximizing both performance and
fairness, which is more computationally efficient than existing metrics. We
then show how FADO can be applied as the basis for unfairness mitigation
pre-processing techniques. Our methods achieve promising results -- up to a 40
p.p. improvement in fairness at a less than 1 p.p. loss in performance compared
to a baseline -- and promote fairness in a data-centric way, where a deeper
understanding of data quality takes center stage.
Related papers
- Targeted Learning for Data Fairness [52.59573714151884]
We expand fairness inference by evaluating fairness in the data generating process itself.
We derive estimators demographic parity, equal opportunity, and conditional mutual information.
To validate our approach, we perform several simulations and apply our estimators to real data.
arXiv Detail & Related papers (2025-02-06T18:51:28Z) - DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks [40.91931801667421]
This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization.
As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task.
arXiv Detail & Related papers (2025-02-01T01:52:32Z) - Data Preparation for Fairness-Performance Trade-Offs: A Practitioner-Friendly Alternative? [11.172805305320592]
Pre-processing techniques, which mitigate bias before training, are effective but may impact model performance and pose integration difficulties.
This report proposes an empirical evaluation of how optimally selected fairness-aware practices, applied in early ML lifecycle stages, can enhance both fairness and performance.
Using FATE, we will analyze the fairness-performance trade-off, comparing pipelines selected by FATE with results by pre-processing bias mitigation techniques.
arXiv Detail & Related papers (2024-12-20T14:12:39Z) - Data Acquisition for Improving Model Fairness using Reinforcement Learning [3.3916160303055563]
We focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness.
We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire.
We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
arXiv Detail & Related papers (2024-12-04T03:56:54Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Data vs. Model Machine Learning Fairness Testing: An Empirical Study [23.535630175567146]
We take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training.
We evaluate the effectiveness of the proposed approach using an empirical analysis of the relationship between model dependent and independent fairness metrics.
Our results indicate that testing for fairness prior to training can be a cheap'' and effective means of catching a biased data collection process early.
arXiv Detail & Related papers (2024-01-15T14:14:16Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.