Related papers: Implicit Sparse Regularization: The Impact of Depth and Early Stopping

Implicit Sparse Regularization: The Impact of Depth and Early Stopping

URL: http://arxiv.org/abs/2108.05574v1
Date: Thu, 12 Aug 2021 07:43:29 GMT
Title: Implicit Sparse Regularization: The Impact of Depth and Early Stopping
Authors: Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde and Raymond K. W. Wong
Abstract summary: We show that early stopping is crucial for gradient descent to converge to a sparse model. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery.
Score: 35.4113861165802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery with sufficiently small initialization and step size. In particular, we show that increasing depth enlarges the scale of working initialization and the early-stopping window, which leads to more stable gradient paths for sparse recovery.

Related papers

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks [0.0]
We prove directional convergence of network parameters of fixed width leaky ReLU two-layer neural networks optimized by gradient descent with exponential loss.<n>As an application, we demonstrate that benign overfitting occurs with high probability in sub-Gaussian mixture models.
arXiv Detail & Related papers (2025-05-22T04:11:58Z)
The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations [29.642830843568525]
We study the time it takes for the gradient of descent to reach the global minimum of a general, non- neural loss function. Motivated by applications to deep networks, we provide a series of extensions our analysis for loss functions with shallow local minima.
arXiv Detail & Related papers (2025-03-20T17:54:04Z)
Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport [26.47265060394168]
We show that the gradient flow for deep neural networks converges arbitrarily at a distance ofr. This is done by relying on the theory of gradient distance of finite width in spaces.
arXiv Detail & Related papers (2024-03-19T16:34:31Z)
Implicit regularization in AI meets generalized hardness of approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks. We link this to the phenomenon of phase transitions in generalized hardness of approximation. Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z)
Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks. We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.