Algorithmic Fairness Datasets: the Story so Far
- URL: http://arxiv.org/abs/2202.01711v4
- Date: Mon, 26 Sep 2022 16:18:15 GMT
- Title: Algorithmic Fairness Datasets: the Story so Far
- Authors: Alessandro Fabris, Stefano Messina, Gianmaria Silvello, Gian Antonio
Susto
- Abstract summary: Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
- Score: 68.45921483094705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven algorithms are studied in diverse domains to support critical
decisions, directly impacting people's well-being. As a result, a growing
community of researchers has been investigating the equity of existing
algorithms and proposing novel ones, advancing the understanding of risks and
opportunities of automated decision-making for historically disadvantaged
populations. Progress in fair Machine Learning hinges on data, which can be
appropriately used only if adequately documented. Unfortunately, the
algorithmic fairness community suffers from a collective data documentation
debt caused by a lack of information on specific resources (opacity) and
scatteredness of available information (sparsity). In this work, we target data
documentation debt by surveying over two hundred datasets employed in
algorithmic fairness research, and producing standardized and searchable
documentation for each of them. Moreover we rigorously identify the three most
popular fairness datasets, namely Adult, COMPAS and German Credit, for which we
compile in-depth documentation.
This unifying documentation effort supports multiple contributions. Firstly,
we summarize the merits and limitations of Adult, COMPAS and German Credit,
adding to and unifying recent scholarship, calling into question their
suitability as general-purpose fairness benchmarks. Secondly, we document and
summarize hundreds of available alternatives, annotating their domain and
supported fairness tasks, along with additional properties of interest for
fairness researchers. Finally, we analyze these datasets from the perspective
of five important data curation topics: anonymization, consent, inclusivity,
sensitive attributes, and transparency. We discuss different approaches and
levels of attention to these topics, making them tangible, and distill them
into a set of best practices for the curation of novel resources.
Related papers
- Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of
Demographic Data Collection in the Pursuit of Fairness [0.0]
We consider calls to collect more data on demographics to enable algorithmic fairness.
We show how these techniques largely ignore broader questions of data governance and systemic oppression.
arXiv Detail & Related papers (2022-04-18T04:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.