Discrete error dynamics of mini-batch gradient descent for least squares regression
- URL: http://arxiv.org/abs/2406.03696v1
- Date: Thu, 6 Jun 2024 02:26:14 GMT
- Title: Discrete error dynamics of mini-batch gradient descent for least squares regression
- Authors: Jackie Lok, Rishi Sonthalia, Elizaveta Rebrova,
- Abstract summary: We study the dynamics of mini-batch gradient descent for at least squares when sampling without replacement.
We also study discretization effects that a continuous-time gradient flow analysis cannot detect, and show that minibatch gradient descent converges to a step-size dependent solution.
- Score: 4.159762735751163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the discrete dynamics of mini-batch gradient descent for least squares regression when sampling without replacement. We show that the dynamics and generalization error of mini-batch gradient descent depends on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$, in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we rigorously establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. We also study discretization effects that a continuous-time gradient flow analysis cannot detect, and show that mini-batch gradient descent converges to a step-size dependent solution, in contrast with full-batch gradient descent. Finally, we investigate the effects of batching, assuming a random matrix model, by using tools from free probability theory to numerically compute the spectrum of $Z$.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Aiming towards the minimizers: fast convergence of SGD for
overparametrized problems [25.077446336619378]
We propose a regularity regime which endows the gradient method with the same worst-case complexity as the gradient method.
All existing guarantees require the gradient method to take small steps, thereby resulting in a much slower linear rate of convergence.
We demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
arXiv Detail & Related papers (2023-06-05T05:21:01Z) - Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set.
We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss.
We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z) - The Implicit Regularization of Stochastic Gradient Flow for Least
Squares [24.976079444818552]
We study the implicit regularization of mini-batch gradient descent, when applied to the fundamental problem of least squares regression.
We leverage a continuous-time differential equation having the same moments as gradient descent, which we call gradient flow.
We give a bound on the excess risk of gradient flow at time $t$, over ridge regression with tuning parameter $lambda = 1/t$.
arXiv Detail & Related papers (2020-03-17T16:37:25Z) - Amortized variance reduction for doubly stochastic objectives [17.064916635597417]
Approximate inference in complex probabilistic models requires optimisation of doubly objective functions.
Current approaches do not take into account how mini-batchity affects samplingity, resulting in sub-optimal variance reduction.
We propose a new approach in which we use a recognition network to cheaply approximate the optimal control variate for each mini-batch, with no additional gradient computations.
arXiv Detail & Related papers (2020-03-09T13:23:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.