The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2004.13146v1
- Date: Mon, 27 Apr 2020 20:06:11 GMT
- Title: The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent
- Authors: Xin Qian, Diego Klabjan
- Abstract summary: The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
- Score: 28.148743710421932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mini-batch stochastic gradient descent (SGD) algorithm is widely used in
training machine learning models, in particular deep learning models. We study
SGD dynamics under linear regression and two-layer linear networks, with an
easy extension to deeper linear networks, by focusing on the variance of the
gradients, which is the first study of this nature. In the linear regression
case, we show that in each iteration the norm of the gradient is a decreasing
function of the mini-batch size $b$ and thus the variance of the stochastic
gradient estimator is a decreasing function of $b$. For deep neural networks
with $L_2$ loss we show that the variance of the gradient is a polynomial in
$1/b$. The results back the important intuition that smaller batch sizes yield
lower loss function values which is a common believe among the researchers. The
proof techniques exhibit a relationship between stochastic gradient estimators
and initial weights, which is useful for further research on the dynamics of
SGD. We empirically provide further insights to our results on various datasets
and commonly used deep network structures.
Related papers
- Discrete error dynamics of mini-batch gradient descent for least squares regression [4.159762735751163]
We study the dynamics of mini-batch gradient descent for at least squares when sampling without replacement.
We also study discretization effects that a continuous-time gradient flow analysis cannot detect, and show that minibatch gradient descent converges to a step-size dependent solution.
arXiv Detail & Related papers (2024-06-06T02:26:14Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Training trajectories, mini-batch losses and the curious role of the
learning rate [13.848916053916618]
We show that validated gradient descent plays a fundamental role in nearly all applications of deep learning.
We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of mini-batches and the full batch.
In particular, a very low loss value can be reached just one step of descent with large enough learning rate.
arXiv Detail & Related papers (2023-01-05T21:58:46Z) - Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set.
We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss.
We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - R2-AD2: Detecting Anomalies by Analysing the Raw Gradient [0.6299766708197883]
We propose a novel semi-supervised anomaly detection method called R2-AD2.
By analysing the temporal distribution of the gradient over multiple training steps, we reliably detect point anomalies.
R2-AD2 works in a purely data-driven way, thus is readily applicable in a variety of important use cases of anomaly detection.
arXiv Detail & Related papers (2022-06-21T11:13:33Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.