Related papers: Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

URL: http://arxiv.org/abs/2306.08553v4
Date: Mon, 23 Sep 2024 16:52:43 GMT
Title: Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach
Authors: Hongyang R. Zhang, Dongyue Li, Haotian Ju,
Abstract summary: We present an algorithm that can effectively regularize the Hessian loss matrices leading to regions with bound loss surfaces. Our approach is effective for improving generalization in pretraining CLIP and chain-of-thought fine-tuning datasets.
Score: 18.009376840944284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor's expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data. We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with weight decay and data augmentation, leading to stronger regularization. Second, our approach remains effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

Related papers

Feature Preserving Shrinkage on Bayesian Neural Networks via the R2D2 Prior [22.218522445858344]
Bayesian neural networks (BNNs) treat neural network weights as random variables, which aim to provide posterior uncertainty estimates.<n>We propose a novel R2D2-Net, which imposes the R2-induced Dirichlet Decomposition (R2D2) prior to the BNN weights.<n>The R2D2-Net can effectively shrink irrelevant coefficients towards zero, while preventing key features from over-shrinkage.
arXiv Detail & Related papers (2025-05-23T18:15:44Z)
Unrolled denoising networks provably learn optimal Bayesian inference [54.79172096306631]
We prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP) For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network converge to the same denoisers used in Bayes AMP.
arXiv Detail & Related papers (2024-09-19T17:56:16Z)
Epistemic Uncertainty and Observation Noise with the Neural Tangent Kernel [12.464924018243988]
Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process. We show how to deal with non-zero aleatoric noise and derive an estimator for the posterior covariance.
arXiv Detail & Related papers (2024-09-06T00:34:44Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees [20.2407347618552]
We study the generalization properties of fine-tuning to understand the problem of overfitting. We present an algorithm and a generalization error guarantee for this algorithm under a class conditional independent noise model.
arXiv Detail & Related papers (2022-06-06T14:52:46Z)
Error-Correcting Neural Networks for Two-Dimensional Curvature Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method. Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)
Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks. We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.