Related papers: Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

URL: http://arxiv.org/abs/2112.01716v1
Date: Fri, 3 Dec 2021 05:01:47 GMT
Title: Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Authors: Bernard Koch, Emily Denton, Alex Hanna, Jacob G. Foster
Abstract summary: We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions.
Score: 3.536605202672355
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.

Related papers

Object Recognition Datasets and Challenges: A Review [5.638005500131518]
We provide a detailed analysis of datasets in the highly investigated object recognition areas.<n>We present an overview of the prominent object recognition benchmarks and competitions.<n>All introduced datasets and challenges can be found online at.com/AbtinDjavadifar/ORDC.
arXiv Detail & Related papers (2025-07-30T03:56:37Z)
What Matters in Learning from Large-Scale Datasets for Robot Manipulation [12.703188997313223]
We conduct a large-scale dataset composition study to answer this question.<n>We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets.<n>We find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval.
arXiv Detail & Related papers (2025-06-16T14:25:29Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions. To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter. Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z)
On The Relevance Of The Differences Between HRTF Measurement Setups For Machine Learning [0.24366811507669117]
spatial audio is enjoying a surge in popularity. Machine learning techniques that have been proven successful in other domains are increasingly used to process head-related transfer function measurements. It becomes attractive to combine multiple datasets, although they are measured under different conditions.
arXiv Detail & Related papers (2022-12-08T14:19:46Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets [122.85598648289789]
We study how multi-domain and multi-task datasets can improve the learning of new tasks in new environments. We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains.
arXiv Detail & Related papers (2021-09-27T23:42:12Z)
Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z)
Data and its (dis)contents: A survey of dataset development and use in machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning. We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
Bringing the People Back In: Contesting Benchmark Machine Learning Datasets [11.00769651520502]
We outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created. We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets.
arXiv Detail & Related papers (2020-07-14T23:22:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.