Related papers: Disparity Between Batches as a Signal for Early Stopping

Disparity Between Batches as a Signal for Early Stopping

URL: http://arxiv.org/abs/2107.06665v1
Date: Wed, 14 Jul 2021 12:59:01 GMT
Title: Disparity Between Batches as a Signal for Early Stopping
Authors: Mahsa Forouzesh and Patrick Thiran
Abstract summary: We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $ell$ norm distance between the gradient vectors of two mini-batches drawn from the training set.
Score: 7.614628596146601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the $\ell_2$ norm distance between the gradient vectors of two mini-batches drawn from the training set. It is derived from a probabilistic upper bound on the difference between the classification errors over a given mini-batch, when the network is trained on this mini-batch and when the network is trained on another mini-batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion (i) when data is limited, as it uses all the samples for training and (ii) when available data has noisy labels, as it signals overfitting better than the validation data. Furthermore, we show in a wide range of experimental settings that gradient disparity is strongly related to the generalization error between the training and test sets, and that it is also very informative about the level of label noise.

Related papers

Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning [42.56511266791916]
SSL tackles this challenge by capitalizing on scarce labeled and abundant unlabeled data. We present active-DeepFA, a method that effectively combines CL, teacher-student-based meta-pseudo-labeling and AL.
arXiv Detail & Related papers (2025-04-25T19:41:45Z)
Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling [18.24907067631541]
Few-shot sequence labeling aims to identify novel classes based on only a few labeled samples. We propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot sequence labeling. Our model achieves new state-of-the-art results on three benchmark datasets.
arXiv Detail & Related papers (2023-07-16T04:50:52Z)
FewSOME: One-Class Few Shot Anomaly Detection with Siamese Networks [0.5735035463793008]
'Few Shot anOMaly detection' (FewSOME) is a deep One-Class Anomaly Detection algorithm with the ability to accurately detect anomalies. FewSOME is aided by pretrained weights with an architecture based on Siamese Networks. Our experiments demonstrate FewSOME performs at state-of-the-art level on benchmark datasets.
arXiv Detail & Related papers (2023-01-17T15:32:34Z)
Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set. We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss. We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z)
ScoreMix: A Scalable Augmentation Strategy for Training GANs with Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z)
Learning from Data with Noisy Labels Using Temporal Self-Ensemble [11.245833546360386]
Deep neural networks (DNNs) have an enormous capacity to memorize noisy labels. Current state-of-the-art methods present a co-training scheme that trains dual networks using samples associated with small losses. We propose a simple yet effective robust training scheme that operates by training only a single network.
arXiv Detail & Related papers (2022-07-21T08:16:31Z)
BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [93.38239238988719]
We propose to enable deep neural networks with the ability to learn the sample relationships from each mini-batch. BatchFormer is applied into the batch dimension of each mini-batch to implicitly explore sample relationships during training. We perform extensive experiments on over ten datasets and the proposed method achieves significant improvements on different data scarcity applications.
arXiv Detail & Related papers (2022-03-03T05:31:33Z)
An analysis of over-sampling labeled data in semi-supervised learning with FixMatch [66.34968300128631]
Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches. This paper studies whether this common practice improves learning and how. We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not.
arXiv Detail & Related papers (2022-01-03T12:22:26Z)
When does gradient descent with logistic loss find interpolating two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)
DivideMix: Learning with Noisy Labels as Semi-supervised Learning [111.03364864022261]
We propose DivideMix, a framework for learning with noisy labels. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2020-02-18T06:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.