Retiring Adult: New Datasets for Fair Machine Learning
- URL: http://arxiv.org/abs/2108.04884v1
- Date: Tue, 10 Aug 2021 19:19:41 GMT
- Title: Retiring Adult: New Datasets for Fair Machine Learning
- Authors: Frances Ding, Moritz Hardt, John Miller, Ludwig Schmidt
- Abstract summary: UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
- Score: 47.27417042497261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the fairness community has recognized the importance of data,
researchers in the area primarily rely on UCI Adult when it comes to tabular
data. Derived from a 1994 US Census survey, this dataset has appeared in
hundreds of research papers where it served as the basis for the development
and comparison of many algorithmic fairness interventions. We reconstruct a
superset of the UCI Adult data from available US Census sources and reveal
idiosyncrasies of the UCI Adult dataset that limit its external validity. Our
primary contribution is a suite of new datasets derived from US Census surveys
that extend the existing data ecosystem for research on fair machine learning.
We create prediction tasks relating to income, employment, health,
transportation, and housing. The data span multiple years and all states of the
United States, allowing researchers to study temporal shift and geographic
variation. We highlight a broad initial sweep of new empirical insights
relating to trade-offs between fairness criteria, performance of algorithmic
interventions, and the role of distribution shift based on our new datasets.
Our findings inform ongoing debates, challenge some existing narratives, and
point to future research directions. Our datasets are available at
https://github.com/zykls/folktables.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification [39.01740345482624]
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary.
Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
arXiv Detail & Related papers (2022-05-04T15:33:00Z) - Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning
Research [3.536605202672355]
We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020.
We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions.
arXiv Detail & Related papers (2021-12-03T05:01:47Z) - On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z) - Bringing the People Back In: Contesting Benchmark Machine Learning
Datasets [11.00769651520502]
We outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created.
We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets.
arXiv Detail & Related papers (2020-07-14T23:22:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.