Related papers: A Critical Field Guide for Working with Machine Learning Datasets

A Critical Field Guide for Working with Machine Learning Datasets

URL: http://arxiv.org/abs/2501.15491v1
Date: Sun, 26 Jan 2025 11:43:33 GMT
Title: A Critical Field Guide for Working with Machine Learning Datasets
Authors: Sarah Ciston, Mike Ananny, Kate Crawford,
Abstract summary: Critical Field Guide for Working with Machine Learning datasets suggests practical guidance for conscientious dataset stewardship.<n>Offers questions, suggestions, strategies, and resources for working with existing machine learning datasets.<n>Students, journalists, artists, researchers, and developers can be more capable of avoiding the problems unique to datasets.
Score: 0.716879432974126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning datasets are powerful but unwieldy. Despite the fact that large datasets commonly contain problematic material--whether from a technical, legal, or ethical perspective--datasets are valuable resources when handled carefully and critically. A Critical Field Guide for Working with Machine Learning Datasets suggests practical guidance for conscientious dataset stewardship. It offers questions, suggestions, strategies, and resources for working with existing machine learning datasets at every phase of their lifecycle. It combines critical AI theories and applied data science concepts, explained in accessible language. Equipped with this understanding, students, journalists, artists, researchers, and developers can be more capable of avoiding the problems unique to datasets. They can also construct more reliable, robust solutions, or even explore new ways of thinking with machine learning datasets that are more critical and conscientious.

Related papers

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [117.54237701533805]
Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks.<n>They often struggle to generalize beyond the distribution of their training data.<n>We identify shortcut learning as a key impediment to generalization.
arXiv Detail & Related papers (2025-08-08T16:14:01Z)
Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices. We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z)
AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience. We develop the tasks involved in dataset development and offer insights into their effective management. Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
A Vision for Semantically Enriched Data Science [19.604667287258724]
Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2023-03-02T16:03:12Z)
Machine Learning for Synthetic Data Generation: A Review [23.073056971997715]
This paper reviews existing studies that employ machine learning models for the purpose of generating synthetic data. The review encompasses various perspectives, starting with the applications of synthetic data generation, spanning computer vision, speech, natural language processing, healthcare, and business domains. The paper also addresses the crucial aspects of privacy and fairness concerns related to synthetic data generation.
arXiv Detail & Related papers (2023-02-08T13:59:31Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning. I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
Data and its (dis)contents: A survey of dataset development and use in machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning. We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
An Ethical Highlighter for People-Centric Dataset Creation [62.886916477131486]
We propose an analytical framework to guide ethical evaluation of existing datasets and to serve future dataset creators in avoiding missteps. Our work is informed by a review and analysis of prior works and highlights where such ethical challenges arise.
arXiv Detail & Related papers (2020-11-27T07:18:44Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.