Neglected Hessian component explains mysteries in Sharpness
regularization
- URL: http://arxiv.org/abs/2401.10809v2
- Date: Wed, 24 Jan 2024 19:09:06 GMT
- Title: Neglected Hessian component explains mysteries in Sharpness
regularization
- Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi
- Abstract summary: We show that differences can be explained by the structure of the Hessian of the loss.
We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
- Score: 19.882170571967368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that methods like SAM which either explicitly or
implicitly penalize second order information can improve generalization in deep
learning. Seemingly similar methods like weight noise and gradient penalties
often fail to provide such benefits. We show that these differences can be
explained by the structure of the Hessian of the loss. First, we show that a
common decomposition of the Hessian can be quantitatively interpreted as
separating the feature exploitation from feature exploration. The feature
exploration, which can be described by the Nonlinear Modeling Error matrix
(NME), is commonly neglected in the literature since it vanishes at
interpolation. Our work shows that the NME is in fact important as it can
explain why gradient penalties are sensitive to the choice of activation
function. Using this insight we design interventions to improve performance. We
also provide evidence that challenges the long held equivalence of weight noise
and gradient penalties. This equivalence relies on the assumption that the NME
can be ignored, which we find does not hold for modern networks since they
involve significant feature learning. We find that regularizing feature
exploitation but not feature exploration yields performance similar to gradient
penalties.
Related papers
- Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies [14.399035468023161]
We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood.
We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for various applications.
arXiv Detail & Related papers (2024-06-17T16:24:23Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Learning sparse features can lead to overfitting in neural networks [9.2104922520782]
We show that feature learning can perform worse than lazy training.
Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth.
arXiv Detail & Related papers (2022-06-24T14:26:33Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Classification and Adversarial examples in an Overparameterized Linear
Model: A Signal Processing Perspective [10.515544361834241]
State-of-the-art deep learning classifiers are highly susceptible to infinitesmal adversarial perturbations.
We find that the learned model is susceptible to adversaries in an intermediate regime where classification generalizes but regression does not.
Despite the adversarial susceptibility, we find that classification with these features can be easier than the more commonly studied "independent feature" models.
arXiv Detail & Related papers (2021-09-27T17:35:42Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Can contrastive learning avoid shortcut solutions? [88.249082564465]
implicit feature modification (IFM) is a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features.
IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.
arXiv Detail & Related papers (2021-06-21T16:22:43Z) - Disentangling Action Sequences: Discovering Correlated Samples [6.179793031975444]
We demonstrate the data itself plays a crucial role in disentanglement and instead of the factors, and the disentangled representations align the latent variables with the action sequences.
We propose a novel framework, fractional variational autoencoder (FVAE) to disentangle the action sequences with different significance step-by-step.
Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.
arXiv Detail & Related papers (2020-10-17T07:37:50Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.