Fairness-Aware Data Valuation for Supervised Learning
- URL: http://arxiv.org/abs/2303.16963v1
- Date: Wed, 29 Mar 2023 18:51:13 GMT
- Title: Fairness-Aware Data Valuation for Supervised Learning
- Authors: Jos\'e Pombal, Pedro Saleiro, M\'ario A. T. Figueiredo, Pedro Bizarro
- Abstract summary: We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks.
We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques.
Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
- Score: 4.874780144224057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation is a ML field that studies the value of training instances
towards a given predictive task. Although data bias is one of the main sources
of downstream model unfairness, previous work in data valuation does not
consider how training instances may influence both performance and fairness of
ML models. Thus, we propose Fairness-Aware Data vauatiOn (FADO), a data
valuation framework that can be used to incorporate fairness concerns into a
series of ML-related tasks (e.g., data pre-processing, exploratory data
analysis, active learning). We propose an entropy-based data valuation metric
suited to address our two-pronged goal of maximizing both performance and
fairness, which is more computationally efficient than existing metrics. We
then show how FADO can be applied as the basis for unfairness mitigation
pre-processing techniques. Our methods achieve promising results -- up to a 40
p.p. improvement in fairness at a less than 1 p.p. loss in performance compared
to a baseline -- and promote fairness in a data-centric way, where a deeper
understanding of data quality takes center stage.
Related papers
- Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field.
The absence of systematic data quality checks poses complications for properly training and testing models.
We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z) - Debiasing Machine Unlearning with Counterfactual Examples [31.931056076782202]
We analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels.
We introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset.
Our method outperforms existing machine unlearning baselines on evaluation metrics.
arXiv Detail & Related papers (2024-04-24T09:33:10Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Data vs. Model Machine Learning Fairness Testing: An Empirical Study [23.535630175567146]
We take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training.
We evaluate the effectiveness of the proposed approach using an empirical analysis of the relationship between model dependent and independent fairness metrics.
Our results indicate that testing for fairness prior to training can be a cheap'' and effective means of catching a biased data collection process early.
arXiv Detail & Related papers (2024-01-15T14:14:16Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - FORML: Learning to Reweight Data for Fairness [2.105564340986074]
We introduce Fairness Optimized Reweighting via Meta-Learning (FORML)
FORML balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters.
We show that FORML improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face prediction task.
arXiv Detail & Related papers (2022-02-03T17:36:07Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.