How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks
- URL: http://arxiv.org/abs/2207.01725v1
- Date: Mon, 4 Jul 2022 21:16:05 GMT
- Title: How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks
- Authors: Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion,
Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law
- Abstract summary: Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
- Score: 99.44608160188905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a small training data set and a learning algorithm, how much more data
is necessary to reach a target validation or test performance? This question is
of critical importance in applications such as autonomous driving or medical
imaging where collecting data is expensive and time-consuming. Overestimating
or underestimating data requirements incurs substantial costs that could be
avoided with an adequate budget. Prior work on neural scaling laws suggest that
the power-law function can fit the validation performance curve and extrapolate
it to larger data set sizes. We find that this does not immediately translate
to the more difficult downstream task of estimating the required data set size
to meet a target performance. In this work, we consider a broad class of
computer vision tasks and systematically investigate a family of functions that
generalize the power-law function to allow for better estimation of data
requirements. Finally, we show that incorporating a tuned correction factor and
collecting over multiple rounds significantly improves the performance of the
data estimators. Using our guidelines, practitioners can accurately estimate
data requirements of machine learning systems to gain savings in both
development time and data acquisition costs.
Related papers
- How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - Certain and Approximately Certain Models for Statistical Learning [4.318959672085627]
We show that it is possible to learn accurate models directly from data with missing values for certain training data and target models.
We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary.
arXiv Detail & Related papers (2024-02-27T22:49:33Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation.
Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels.
We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - Where Should I Spend My FLOPS? Efficiency Evaluations of Visual
Pre-training Methods [29.141145775835106]
Given a fixed FLOP budget, what are the best datasets, models, and (self-supervised) training methods for obtaining high accuracy on representative visual tasks?
We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised)
Our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data.
arXiv Detail & Related papers (2022-09-30T17:04:55Z) - Training from Zero: Radio Frequency Machine Learning Data Quantity Forecasting [0.0]
The data used during training in any given application space is directly tied to the performance of the system once deployed.
One of the underlying rule of thumbs used within the machine learning space is that more data leads to better models.
This work examines a modulation classification problem in the Radio Frequency domain space.
arXiv Detail & Related papers (2022-05-07T18:45:06Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.