On Anytime Learning at Macroscale
- URL: http://arxiv.org/abs/2106.09563v1
- Date: Thu, 17 Jun 2021 14:45:22 GMT
- Title: On Anytime Learning at Macroscale
- Authors: Lucas Caccia, Jing Xu, Myle Ott, Marc'Aurelio Ranzato, Ludovic Denoyer
- Abstract summary: In many practical applications, data does not arrive all at once, but in batches over time.
A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data.
A tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance.
- Score: 33.674452784463774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classical machine learning frameworks assume access to a possibly large
dataset in order to train a predictive model. In many practical applications
however, data does not arrive all at once, but in batches over time. This
creates a natural trade-off between accuracy of a model and time to obtain such
a model. A greedy predictor could produce non-trivial predictions by
immediately training on batches as soon as these become available but, it may
also make sub-optimal use of future data. On the other hand, a tardy predictor
could wait for a long time to aggregate several batches into a larger dataset,
but ultimately deliver a much better performance. In this work, we consider
such a streaming learning setting, which we dub {\em anytime learning at
macroscale} (ALMA). It is an instance of anytime learning applied not at the
level of a single chunk of data, but at the level of the entire sequence of
large batches. We first formalize this learning setting, we then introduce
metrics to assess how well learners perform on the given task for a given
memory and compute budget, and finally we test several baseline approaches on
standard benchmarks repurposed for anytime learning at macroscale. The general
finding is that bigger models always generalize better. In particular, it is
important to grow model capacity over time if the initial model is relatively
small. Moreover, updating the model at an intermediate rate strikes the best
trade off between accuracy and time to obtain a useful predictor.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Few-Shot Load Forecasting Under Data Scarcity in Smart Grids: A Meta-Learning Approach [0.18641315013048293]
This paper proposes adapting an established model-agnostic meta-learning algorithm for short-term load forecasting.
The proposed method can rapidly adapt and generalize within any unknown load time series of arbitrary length.
The proposed model is evaluated using a dataset of historical load consumption data from real-world consumers.
arXiv Detail & Related papers (2024-06-09T18:59:08Z) - Contrastive Difference Predictive Coding [79.74052624853303]
We introduce a temporal difference version of contrastive predictive coding that stitches together pieces of different time series data to decrease the amount of data required to learn predictions of future events.
We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL.
arXiv Detail & Related papers (2023-10-31T03:16:32Z) - Pushing the Limits of Pre-training for Time Series Forecasting in the
CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain.
We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size.
Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z) - Geometry-Aware Adaptation for Pretrained Models [15.715395029966812]
We propose a drop-in replacement of the standard prediction rule, swapping argmax with the Fr'echet mean.
Our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet.
When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models.
arXiv Detail & Related papers (2023-07-23T04:48:41Z) - Instance-Conditional Timescales of Decay for Non-Stationary Learning [11.90763787610444]
Slow concept drift is a ubiquitous, yet under-studied problem in machine learning systems.
We propose an optimization-driven approach towards balancing instance importance over large training windows.
Experiments on a large real-world dataset of 39M photos over a 9 year period show upto 15% relative gains in accuracy.
arXiv Detail & Related papers (2022-12-12T14:16:26Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Multi-Objective Model Selection for Time Series Forecasting [9.473440847947492]
We present a benchmark, evaluating 7 classical and 6 deep learning forecasting methods on 44 datasets.
We leverage the benchmark evaluations to learn good defaults that consider multiple objectives such as accuracy and latency.
By learning a mapping from forecasting models to performance metrics, we show that our method PARETOSELECT is able to accurately select models.
arXiv Detail & Related papers (2022-02-17T07:40:15Z) - Model-based micro-data reinforcement learning: what are the crucial
model properties and which model to choose? [0.2836066255205732]
We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models.
We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin.
We also found that deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts.
arXiv Detail & Related papers (2021-07-24T11:38:25Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.