Explicit Regularization via Regularizer Mirror Descent
- URL: http://arxiv.org/abs/2202.10788v1
- Date: Tue, 22 Feb 2022 10:21:44 GMT
- Title: Explicit Regularization via Regularizer Mirror Descent
- Authors: Navid Azizan, Sahin Lale, and Babak Hassibi
- Abstract summary: We propose a new method for training deep neural networks (DNNs) with regularization, called regularizer mirror descent (RMD)
RMD simultaneously interpolates the training data and minimizes a certain potential function of the weights.
Our results suggest that the performance ofRMD is remarkably robust and significantly better than both gradient descent (SGD) and weight decay.
- Score: 32.0512015286512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite perfectly interpolating the training data, deep neural networks
(DNNs) can often generalize fairly well, in part due to the "implicit
regularization" induced by the learning algorithm. Nonetheless, various forms
of regularization, such as "explicit regularization" (via weight decay), are
often used to avoid overfitting, especially when the data is corrupted. There
are several challenges with explicit regularization, most notably unclear
convergence properties. Inspired by convergence properties of stochastic mirror
descent (SMD) algorithms, we propose a new method for training DNNs with
regularization, called regularizer mirror descent (RMD). In highly
overparameterized DNNs, SMD simultaneously interpolates the training data and
minimizes a certain potential function of the weights. RMD starts with a
standard cost which is the sum of the training loss and a convex regularizer of
the weights. Reinterpreting this cost as the potential of an "augmented"
overparameterized network and applying SMD yields RMD. As a result, RMD
inherits the properties of SMD and provably converges to a point "close" to the
minimizer of this cost. RMD is computationally comparable to stochastic
gradient descent (SGD) and weight decay, and is parallelizable in the same
manner. Our experimental results on training sets with various levels of
corruption suggest that the generalization performance of RMD is remarkably
robust and significantly better than both SGD and weight decay, which
implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can
also be used to regularize the weights to a desired weight vector, which is
particularly relevant for continual learning.
Related papers
- Deep Learning Weight Pruning with RMT-SVD: Increasing Accuracy and
Reducing Overfitting [0.0]
The spectrum of the weight layers of a deep neural network (DNN) can be studied and understood using techniques from random matrix theory (RMT)
In this work, these RMT techniques will be used to determine which and how many singular values should be removed from the weight layers of a DNN during training, via singular value decomposition (SVD)
We show the results on a simple DNN model trained on MNIST.
arXiv Detail & Related papers (2023-03-15T23:19:45Z) - The Generalization Error of Stochastic Mirror Descent on
Over-Parametrized Linear Models [37.6314945221565]
Deep networks are known to generalize well to unseen data.
Regularization properties ensure interpolating solutions with "good" properties are found.
We present simulation results that validate the theory and introduce two data models.
arXiv Detail & Related papers (2023-02-18T22:23:42Z) - Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture.
It can model the feature space more comprehensively and reduce the dominance of head classes.
The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Implicit Regularization Properties of Variance Reduced Stochastic Mirror
Descent [7.00422423634143]
We prove that the discrete VRSMD estimator sequence converges to the minimum mirror interpolant in the linear regression.
We derive a model estimation accuracy result in the setting when the true model is sparse.
arXiv Detail & Related papers (2022-04-29T19:37:24Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Preprint: Norm Loss: An efficient yet effective regularization method
for deep neural networks [7.214681039134488]
We propose a weight soft-regularization method based on the oblique manifold.
We evaluate our method on the popular CIFAR-10, CIFAR-100 and ImageNet 2012 datasets.
arXiv Detail & Related papers (2021-03-11T10:24:49Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.