Related papers: Training trajectories, mini-batch losses and the curious role of the learning rate

Training trajectories, mini-batch losses and the curious role of the learning rate

URL: http://arxiv.org/abs/2301.02312v1
Date: Thu, 5 Jan 2023 21:58:46 GMT
Title: Training trajectories, mini-batch losses and the curious role of the learning rate
Authors: Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Nolan Miller
Abstract summary: We show that validated gradient descent plays a fundamental role in nearly all applications of deep learning. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of mini-batches and the full batch. In particular, a very low loss value can be reached just one step of descent with large enough learning rate.
Score: 13.848916053916618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its efficiency and remarkable ability to converge to global minimum remains shrouded in mystery. The loss function defined on a large network with large amount of data is known to be non-convex. However, relatively little has been explored about the behavior of loss function on individual batches. Remarkably, we show that for ResNet the loss for any fixed mini-batch when measured along side SGD trajectory appears to be accurately modeled by a quadratic function. In particular, a very low loss value can be reached in just one step of gradient descent with large enough learning rate. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch and how the learning rate affects the relationship between improvement on individual and full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a few steps apart, also significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.

Related papers

Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
SuSana Distancia is all you need: Enforcing class separability in metric learning via two novel distance-based loss functions for few-shot image classification [0.9236074230806579]
We propose two loss functions which consider the importance of the embedding vectors by looking at the intra-class and inter-class distance between the few data. Our results show a significant improvement in accuracy in the miniImagenNet benchmark compared to other metric-based few-shot learning methods by a margin of 2%.
arXiv Detail & Related papers (2023-05-15T23:12:09Z)
Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural Networks [89.28881869440433]
This paper provides the first theoretical characterization of joint edge-model sparse learning for graph neural networks (GNNs) It proves analytically that both sampling important nodes and pruning neurons with the lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy.
arXiv Detail & Related papers (2023-02-06T16:54:20Z)
Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set. We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss. We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z)
Improve SGD Training via Aligning Mini-batches [22.58823484394866]
In-Training Distribution Matching (ITDM) is proposed to improve deep neural networks (DNNs) training and reduce overfitting. Specifically, ITDM regularizes the feature extractor by matching the moments of distributions of different mini-batches in each iteration of SGD.
arXiv Detail & Related papers (2020-02-23T15:10:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.