The Effect of SGD Batch Size on Autoencoder Learning: Sparsity,
Sharpness, and Feature Learning
- URL: http://arxiv.org/abs/2308.03215v1
- Date: Sun, 6 Aug 2023 21:54:07 GMT
- Title: The Effect of SGD Batch Size on Autoencoder Learning: Sparsity,
Sharpness, and Feature Learning
- Authors: Nikhil Ghosh, Spencer Frei, Wooseok Ha, and Bin Yu
- Abstract summary: We investigate the dynamics of gradient descent (SGD) when a single-neuron autoencoder is used.
For any batch size smaller than the number of samples, SGD finds a global minimum which is sparse and nearly strictly to its randomness.
- Score: 14.004531386769328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we investigate the dynamics of stochastic gradient descent
(SGD) when training a single-neuron autoencoder with linear or ReLU activation
on orthogonal data. We show that for this non-convex problem, randomly
initialized SGD with a constant step size successfully finds a global minimum
for any batch size choice. However, the particular global minimum found depends
upon the batch size. In the full-batch setting, we show that the solution is
dense (i.e., not sparse) and is highly aligned with its initialized direction,
showing that relatively little feature learning occurs. On the other hand, for
any batch size strictly smaller than the number of samples, SGD finds a global
minimum which is sparse and nearly orthogonal to its initialization, showing
that the randomness of stochastic gradients induces a qualitatively different
type of "feature selection" in this setting. Moreover, if we measure the
sharpness of the minimum by the trace of the Hessian, the minima found with
full batch gradient descent are flatter than those found with strictly smaller
batch sizes, in contrast to previous works which suggest that large batches
lead to sharper minima. To prove convergence of SGD with a constant step size,
we introduce a powerful tool from the theory of non-homogeneous random walks
which may be of independent interest.
Related papers
- Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression [4.159762735751163]
We study the dynamics of mini-batch gradient descent with random reshuffling for least squares regression.
We show that the training and errors depend on a sample cross-covariance matrix $Z$.
arXiv Detail & Related papers (2024-06-06T02:26:14Z) - Relationship between Batch Size and Number of Steps Needed for Nonconvex
Optimization of Stochastic Gradient Descent using Armijo Line Search [0.8158530638728501]
We show that SGD performs better than other deep learning networks when it uses deep numerical line.
The results indicate that the number of steps needed for SFO as the batch size grows can be estimated.
arXiv Detail & Related papers (2023-07-25T21:59:17Z) - Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of
Deep Learning Optimizer using Hyperparameters Close to One [0.0]
We show that deep learnings using small constant learning rates, hyper parameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions.
Results indicate that Adam using a small constant learning rate, hyper parameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and gradient descent.
arXiv Detail & Related papers (2022-08-21T06:11:23Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z) - Choosing the Sample with Lowest Loss makes SGD Robust [19.08973384659313]
We propose a simple variant of the simple gradient descent (SGD) method in each step.
Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss.
Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
arXiv Detail & Related papers (2020-01-10T05:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.