Intersection of Parallels as an Early Stopping Criterion
- URL: http://arxiv.org/abs/2208.09529v1
- Date: Fri, 19 Aug 2022 19:42:41 GMT
- Title: Intersection of Parallels as an Early Stopping Criterion
- Authors: Ali Vardasbi, Maarten de Rijke, Mostafa Dehghani
- Abstract summary: We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
- Score: 64.8387564654474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common way to avoid overfitting in supervised learning is early stopping,
where a held-out set is used for iterative evaluation during training to find a
sweet spot in the number of training steps that gives maximum generalization.
However, such a method requires a disjoint validation set, thus part of the
labeled data from the training set is usually left out for this purpose, which
is not ideal when training data is scarce. Furthermore, when the training
labels are noisy, the performance of the model over a validation set may not be
an accurate proxy for generalization. In this paper, we propose a method to
spot an early stopping point in the training iterations without the need for a
validation set. We first show that in the overparameterized regime the randomly
initialized weights of a linear model converge to the same direction during
training. Using this result, we propose to train two parallel instances of a
linear model, initialized with different random seeds, and use their
intersection as a signal to detect overfitting. In order to detect
intersection, we use the cosine distance between the weights of the parallel
models during training iterations. Noticing that the final layer of a NN is a
linear map of pre-last layer activations to output logits, we build on our
criterion for linear models and propose an extension to multi-layer networks,
using the new notion of counterfactual weights. We conduct experiments on two
areas that early stopping has noticeable impact on preventing overfitting of a
NN: (i) learning from noisy labels; and (ii) learning to rank in IR. Our
experiments on four widely used datasets confirm the effectiveness of our
method for generalization. For a wide range of learning rates, our method,
called Cosine-Distance Criterion (CDC), leads to better generalization on
average than all the methods that we compare against in almost all of the
tested cases.
Related papers
- Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data [42.870635753205185]
Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors.
We show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data.
At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a "grokking" phenomenon.
arXiv Detail & Related papers (2023-10-04T02:50:34Z) - RanPAC: Random Projections and Pre-trained Models for Continual Learning [59.07316955610658]
Continual learning (CL) aims to learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones.
We propose a concise and effective approach for CL with pre-trained models.
arXiv Detail & Related papers (2023-07-05T12:49:02Z) - Learning from Data with Noisy Labels Using Temporal Self-Ensemble [11.245833546360386]
Deep neural networks (DNNs) have an enormous capacity to memorize noisy labels.
Current state-of-the-art methods present a co-training scheme that trains dual networks using samples associated with small losses.
We propose a simple yet effective robust training scheme that operates by training only a single network.
arXiv Detail & Related papers (2022-07-21T08:16:31Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Out-of-Scope Intent Detection with Self-Supervision and Discriminative
Training [20.242645823965145]
Out-of-scope intent detection is of practical importance in task-oriented dialogue systems.
We propose a method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training.
We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2021-06-16T08:17:18Z) - How Important is the Train-Validation Split in Meta-Learning? [155.5088631672781]
A common practice in meta-learning is to perform a train-validation split (emphtrain-val method) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice.
We show that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
arXiv Detail & Related papers (2020-10-12T16:48:42Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.