Time and the Value of Data
- URL: http://arxiv.org/abs/2203.09118v1
- Date: Thu, 17 Mar 2022 06:53:46 GMT
- Title: Time and the Value of Data
- Authors: Ehsan Valavi, Joel Hestness, Newsha Ardalani, Marco Iansiti
- Abstract summary: Managers often believe that collecting more data will continually improve the accuracy of their machine learning models.
We argue that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data.
- Score: 0.3010893618491329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Managers often believe that collecting more data will continually improve the
accuracy of their machine learning models. However, we argue in this paper that
when data lose relevance over time, it may be optimal to collect a limited
amount of recent data instead of keeping around an infinite supply of older
(less relevant) data. In addition, we argue that increasing the stock of data
by including older datasets may, in fact, damage the model's accuracy.
Expectedly, the model's accuracy improves by increasing the flow of data
(defined as data collection rate); however, it requires other tradeoffs in
terms of refreshing or retraining machine learning models more frequently.
Using these results, we investigate how the business value created by machine
learning models scales with data and when the stock of data establishes a
sustainable competitive advantage. We argue that data's time-dependency weakens
the barrier to entry that the stock of data creates. As a result, a competing
firm equipped with a limited (yet sufficient) amount of recent data can develop
more accurate models. This result, coupled with the fact that older datasets
may deteriorate models' accuracy, suggests that created business value doesn't
scale with the stock of available data unless the firm offloads less relevant
data from its data repository. Consequently, a firm's growth policy should
incorporate a balance between the stock of historical data and the flow of new
data.
We complement our theoretical results with an experiment. In the experiment,
we empirically measure the loss in the accuracy of a next word prediction model
trained on datasets from various time periods. Our empirical measurements
confirm the economic significance of the value decline over time. For example,
100MB of text data, after seven years, becomes as valuable as 50MB of current
data for the next word prediction task.
Related papers
- Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality [67.67387254989018]
We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication.
We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
arXiv Detail & Related papers (2025-03-10T21:51:17Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.
We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models.
Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data.
We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z) - The Data Addition Dilemma [4.869513274920574]
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources.
But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings?
We identify this situation as the textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance.
arXiv Detail & Related papers (2024-08-08T01:42:31Z) - F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data [65.6499834212641]
We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm.
By considering domain similarities through task-specific metadata, our model improved generalization, where the excess risk decreases as the number of training tasks increases.
Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
arXiv Detail & Related papers (2024-06-23T21:28:50Z) - TimeGPT in Load Forecasting: A Large Time Series Model Perspective [38.92798207166188]
Machine learning models have made significant progress in load forecasting, but their forecast accuracy is limited in cases where historical load data is scarce.
This paper aims to discuss the potential of large time series models in load forecasting with scarce historical data.
arXiv Detail & Related papers (2024-04-07T09:05:09Z) - Quilt: Robust Data Segment Selection against Concept Drifts [30.62320149405819]
Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams.
concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy.
Existing concept drift adaptation approaches mostly focus on updating the model to the new data and tend to discard the drifted historical data.
We propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy.
arXiv Detail & Related papers (2023-12-15T11:10:34Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - An Investigation of Smart Contract for Collaborative Machine Learning
Model Training [3.5679973993372642]
Collaborative machine learning (CML) has penetrated various fields in the era of big data.
As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy.
Based on blockchain, smart contracts enable automatic execution of data preserving and validation.
arXiv Detail & Related papers (2022-09-12T04:25:01Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Data Appraisal Without Data Sharing [28.41079503636652]
We develop methods that do not require data sharing by using secure multi-party computation.
Our experiments show that influence functions provide an appealing trade-off between high-quality appraisal and required computation.
arXiv Detail & Related papers (2020-12-11T15:45:19Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.