Data Budgeting for Machine Learning
- URL: http://arxiv.org/abs/2210.00987v1
- Date: Mon, 3 Oct 2022 14:53:17 GMT
- Title: Data Budgeting for Machine Learning
- Authors: Xinyi Zhao, Weixin Liang and James Zou
- Abstract summary: We study the data budgeting problem and formulate it as two sub-problems.
We propose a learning method to solve data budgeting problems.
Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
- Score: 17.524791147624086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is the fuel powering AI and creates tremendous value for many domains.
However, collecting datasets for AI is a time-consuming, expensive, and
complicated endeavor. For practitioners, data investment remains to be a leap
of faith in practice. In this work, we study the data budgeting problem and
formulate it as two sub-problems: predicting (1) what is the saturating
performance if given enough data, and (2) how many data points are needed to
reach near the saturating performance. Different from traditional
dataset-independent methods like PowerLaw, we proposed a learning method to
solve data budgeting problems. To support and systematically evaluate the
learning-based method for data budgeting, we curate a large collection of 383
tabular ML datasets, along with their data vs performance curves. Our empirical
evaluation shows that it is possible to perform data budgeting given a small
pilot study dataset with as few as $50$ data points.
Related papers
- Compute-Constrained Data Selection [77.06528009072967]
We formalize the problem of data selection with a cost-aware utility function, and model the problem as trading off initial-selection cost for training gain.
We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm [14.206050847214652]
We introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time.
The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model.
We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's.
arXiv Detail & Related papers (2023-06-05T02:28:19Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Data Collection and Quality Challenges in Deep Learning: A Data-Centric
AI Perspective [16.480530590466472]
Data-centric AI practices are now becoming mainstream.
Many datasets in the real world are small, dirty, biased, and even poisoned.
For data quality, we study data validation and data cleaning techniques.
arXiv Detail & Related papers (2021-12-13T03:57:36Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.