Related papers: Neglected Hessian component explains mysteries in Sharpness regularization

Neglected Hessian component explains mysteries in Sharpness regularization

URL: http://arxiv.org/abs/2401.10809v2
Date: Wed, 24 Jan 2024 19:09:06 GMT
Title: Neglected Hessian component explains mysteries in Sharpness regularization
Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi
Abstract summary: We show that differences can be explained by the structure of the Hessian of the loss. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.
Score: 19.882170571967368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.

Related papers

Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies [14.399035468023161]
We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood. We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for various applications.
arXiv Detail & Related papers (2024-06-17T16:24:23Z)
Perception-Oriented Video Frame Interpolation via Asymmetric Blending [20.0024308216849]
Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects. We propose PerVFI (Perception-oriented Video Frame Interpolation) to mitigate these challenges. Experimental results validate the superiority of PerVFI, demonstrating significant improvements in perceptual quality compared to existing methods.
arXiv Detail & Related papers (2024-04-10T02:40:17Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Learning sparse features can lead to overfitting in neural networks [9.2104922520782]
We show that feature learning can perform worse than lazy training. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth.
arXiv Detail & Related papers (2022-06-24T14:26:33Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Classification and Adversarial examples in an Overparameterized Linear Model: A Signal Processing Perspective [10.515544361834241]
State-of-the-art deep learning classifiers are highly susceptible to infinitesmal adversarial perturbations. We find that the learned model is susceptible to adversaries in an intermediate regime where classification generalizes but regression does not. Despite the adversarial susceptibility, we find that classification with these features can be easier than the more commonly studied "independent feature" models.
arXiv Detail & Related papers (2021-09-27T17:35:42Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Can contrastive learning avoid shortcut solutions? [88.249082564465]
implicit feature modification (IFM) is a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.
arXiv Detail & Related papers (2021-06-21T16:22:43Z)
Disentangling Action Sequences: Discovering Correlated Samples [6.179793031975444]
We demonstrate the data itself plays a crucial role in disentanglement and instead of the factors, and the disentangled representations align the latent variables with the action sequences. We propose a novel framework, fractional variational autoencoder (FVAE) to disentangle the action sequences with different significance step-by-step. Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.
arXiv Detail & Related papers (2020-10-17T07:37:50Z)
Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD. We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.