Related papers: The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

URL: http://arxiv.org/abs/2308.03215v1
Date: Sun, 6 Aug 2023 21:54:07 GMT
Title: The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning
Authors: Nikhil Ghosh, Spencer Frei, Wooseok Ha, and Bin Yu
Abstract summary: We investigate the dynamics of gradient descent (SGD) when a single-neuron autoencoder is used. For any batch size smaller than the number of samples, SGD finds a global minimum which is sparse and nearly strictly to its randomness.
Score: 14.004531386769328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.

Related papers

Discrete error dynamics of mini-batch gradient descent for least squares regression [4.159762735751163]
We study the dynamics of mini-batch gradient descent for at least squares when sampling without replacement. We also study discretization effects that a continuous-time gradient flow analysis cannot detect, and show that minibatch gradient descent converges to a step-size dependent solution.
arXiv Detail & Related papers (2024-06-06T02:26:14Z)
Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search [0.8158530638728501]
We show that SGD performs better than other deep learning networks when it uses deep numerical line. The results indicate that the number of steps needed for SFO as the batch size grows can be estimated.
arXiv Detail & Related papers (2023-07-25T21:59:17Z)
Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One [0.0]
We show that deep learnings using small constant learning rates, hyper parameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. Results indicate that Adam using a small constant learning rate, hyper parameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and gradient descent.
arXiv Detail & Related papers (2022-08-21T06:11:23Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
Choosing the Sample with Lowest Loss makes SGD Robust [19.08973384659313]
We propose a simple variant of the simple gradient descent (SGD) method in each step. Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss. Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
arXiv Detail & Related papers (2020-01-10T05:39:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.