Designing Data: Proactive Data Collection and Iteration for Machine
Learning
- URL: http://arxiv.org/abs/2301.10319v2
- Date: Sat, 29 Jul 2023 02:40:16 GMT
- Title: Designing Data: Proactive Data Collection and Iteration for Machine
Learning
- Authors: Aspen Hopkins, Fred Hohman, Luca Zappella, Xavier Suau Cuadros and
Dominik Moritz
- Abstract summary: Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
- Score: 12.295169687537395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lack of diversity in data collection has caused significant failures in
machine learning (ML) applications. While ML developers perform post-collection
interventions, these are time intensive and rarely comprehensive. Thus, new
methods to track & manage data collection, iteration, and model training are
necessary for evaluating whether datasets reflect real world variability. We
present designing data, an iterative approach to data collection connecting HCI
concepts with ML techniques. Our process includes (1) Pre-Collection Planning,
to reflexively prompt and document expected data distributions; (2) Collection
Monitoring, to systematically encourage sampling diversity; and (3) Data
Familiarity, to identify samples that are unfamiliar to a model using density
estimation. We apply designing data to a data collection and modeling task. We
find models trained on ''designed'' datasets generalize better across
intersectional groups than those trained on similarly sized but less targeted
datasets, and that data familiarity is effective for debugging datasets.
Related papers
- Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training.
We propose a clustering-based approach for building ones satisfying all these criteria.
Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - On Inter-dataset Code Duplication and Data Leakage in Large Language
Models [5.704848262917858]
This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating large language models (LLMs)
We identify the intersection between the pre-training and fine-tuning datasets using a deduplication process.
We fine-tune four models pre-trained on CSN to evaluate their performance on samples encountered during pre-training and those unseen during that phase.
arXiv Detail & Related papers (2024-01-15T19:46:40Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z) - Overcoming Noisy and Irrelevant Data in Federated Learning [13.963024590508038]
Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices.
We propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset.
The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients.
arXiv Detail & Related papers (2020-01-22T22:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.