Optimizing Data Collection for Machine Learning
- URL: http://arxiv.org/abs/2210.01234v1
- Date: Mon, 3 Oct 2022 21:19:05 GMT
- Title: Optimizing Data Collection for Machine Learning
- Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law
- Abstract summary: Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
- Score: 87.37252958806856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern deep learning systems require huge data sets to achieve impressive
performance, but there is little guidance on how much or what kind of data to
collect. Over-collecting data incurs unnecessary present costs, while
under-collecting may incur future costs and delay workflows. We propose a new
paradigm for modeling the data collection workflow as a formal optimal data
collection problem that allows designers to specify performance targets,
collection costs, a time horizon, and penalties for failing to meet the
targets. Additionally, this formulation generalizes to tasks requiring multiple
data sources, such as labeled and unlabeled data used in semi-supervised
learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which
minimizes expected future collection costs. Finally, we numerically compare our
framework to the conventional baseline of estimating data requirements by
extrapolating from neural scaling laws. We significantly reduce the risks of
failing to meet desired performance targets on several classification,
segmentation, and detection tasks, while maintaining low total collection
costs.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - Building Manufacturing Deep Learning Models with Minimal and Imbalanced
Training Data Using Domain Adaptation and Data Augmentation [15.333573151694576]
We propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task.
Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces.
We evaluate our combined approach using image data for wafer defect prediction.
arXiv Detail & Related papers (2023-05-31T21:45:34Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Designing Data: Proactive Data Collection and Iteration for Machine
Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Training Over-parameterized Models with Non-decomposable Objectives [46.62273918807789]
We propose new cost-sensitive losses that extend the classical idea of logit adjustment to handle more general cost matrices.
Our losses are calibrated, and can be further improved with distilled labels from a teacher model.
arXiv Detail & Related papers (2021-07-09T19:29:33Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.