How Important is the Train-Validation Split in Meta-Learning?
- URL: http://arxiv.org/abs/2010.05843v2
- Date: Tue, 9 Feb 2021 21:07:48 GMT
- Title: How Important is the Train-Validation Split in Meta-Learning?
- Authors: Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason D. Lee, Sham Kakade,
Huan Wang, Caiming Xiong
- Abstract summary: A common practice in meta-learning is to perform a train-validation split (emphtrain-val method) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice.
We show that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
- Score: 155.5088631672781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Meta-learning aims to perform fast adaptation on a new task through learning
a "prior" from multiple existing tasks. A common practice in meta-learning is
to perform a train-validation split (\emph{train-val method}) where the prior
adapts to the task on one split of the data, and the resulting predictor is
evaluated on another split. Despite its prevalence, the importance of the
train-validation split is not well understood either in theory or in practice,
particularly in comparison to the more direct \emph{train-train method}, which
uses all the per-task data for both training and evaluation.
We provide a detailed theoretical study on whether and when the
train-validation split is helpful in the linear centroid meta-learning problem.
In the agnostic case, we show that the expected loss of the train-val method is
minimized at the optimal prior for meta testing, and this is not the case for
the train-train method in general without structural assumptions on the data.
In contrast, in the realizable case where the data are generated from linear
models, we show that both the train-val and train-train losses are minimized at
the optimal prior in expectation. Further, perhaps surprisingly, our main
result shows that the train-train method achieves a \emph{strictly better}
excess loss in this realizable case, even when the regularization parameter and
split ratio are optimally tuned for both methods. Our results highlight that
sample splitting may not always be preferable, especially when the data is
realizable by the model. We validate our theories by experimentally showing
that the train-train method can indeed outperform the train-val method, on both
simulations and real meta-learning tasks.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Boosting Meta-Training with Base Class Information for Few-Shot Learning [35.144099160883606]
We propose an end-to-end training paradigm consisting of two alternative loops.
In the outer loop, we calculate cross entropy loss on the entire training set while updating only the final linear layer.
This training paradigm not only converges quickly but also outperforms existing baselines, indicating that information from the overall training set and the meta-learning training paradigm could mutually reinforce one another.
arXiv Detail & Related papers (2024-03-06T05:13:23Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Learning from Data with Noisy Labels Using Temporal Self-Ensemble [11.245833546360386]
Deep neural networks (DNNs) have an enormous capacity to memorize noisy labels.
Current state-of-the-art methods present a co-training scheme that trains dual networks using samples associated with small losses.
We propose a simple yet effective robust training scheme that operates by training only a single network.
arXiv Detail & Related papers (2022-07-21T08:16:31Z) - You Only Need End-to-End Training for Long-Tailed Recognition [8.789819609485225]
Cross-entropy loss tends to produce highly correlated features on imbalanced data.
We propose two novel modules, Block-based Relatively Balanced Batch Sampler (B3RS) and Batch Embedded Training (BET)
Experimental results on the long-tailed classification benchmarks, CIFAR-LT and ImageNet-LT, demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-12-11T11:44:09Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - A Representation Learning Perspective on the Importance of
Train-Validation Splitting in Meta-Learning [14.720411598827365]
splitting data from each task into train and validation sets during meta-training.
We argue that the train-validation split encourages the learned representation to be low-rank without compromising on expressivity.
Since sample efficiency benefits from low-rankness, the splitting strategy will require very few samples to solve unseen test tasks.
arXiv Detail & Related papers (2021-06-29T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.