Related papers: Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

URL: http://arxiv.org/abs/2403.08821v1
Date: Sat, 24 Feb 2024 05:48:05 GMT
Title: Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization
Authors: Jiaxin Deng, Junbiao Pang, Baochang Zhang, Tian Wang,
Abstract summary: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs. We propose a simple yet efficient sampling method to significantly accelerate SAM.
Score: 19.469113881229646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.

Related papers

Enhanced Derivative-Free Optimization Using Adaptive Correlation-Induced Finite Difference Estimators [6.054123928890574]
We develop an algorithm designed to enhance DFO in terms of both gradient estimation efficiency and sample efficiency. We establish the consistency of our proposed algorithm and demonstrate that, despite using a batch of samples per iteration, it achieves the same convergence rate as the KW and SPSA methods.
arXiv Detail & Related papers (2025-02-28T08:05:54Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models [42.59948316941217]
Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and generalization degradation. We propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models.
arXiv Detail & Related papers (2024-06-19T01:03:23Z)
Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization [17.670203551488218]
We propose Asymptotic Unbiased Sampling to accelerate Sharpness-Aware Minimization (AUSAM) AUSAM maintains the model's generalization capacity while significantly enhancing computational efficiency. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks.
arXiv Detail & Related papers (2024-06-12T08:47:44Z)
Friendly Sharpness-Aware Minimization [62.57515991835801]
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. We investigate the key role of batch-specific gradient noise within the adversarial perturbation, i.e., the current minibatch gradient. By decomposing the adversarial gradient noise components, we discover that relying solely on the full gradient degrades generalization while excluding it leads to improved performance.
arXiv Detail & Related papers (2024-03-19T01:39:33Z)
Data Pruning via Moving-one-Sample-out [61.45441981346064]
We propose a novel data-pruning approach called moving-one-sample-out (MoSo) MoSo aims to identify and remove the least informative samples from the training set. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios.
arXiv Detail & Related papers (2023-10-23T08:00:03Z)
Quantum Shadow Gradient Descent for Variational Quantum Algorithms [14.286227676294034]
Gradient-based gradient estimation has been proposed for training variational quantum circuits in quantum neural networks (QNNs) The task of gradient estimation has proven to be challenging due to distinctive quantum features such as state collapse and measurement incompatibility. We develop a novel procedure called quantum shadow descent that uses a single sample per iteration to estimate all components of the gradient.
arXiv Detail & Related papers (2023-10-10T18:45:43Z)
AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks [76.90477930208982]
Sharpness aware (SAM) has been extensively explored as it can generalize better for training deep neural networks. Integrating SAM with adaptive learning perturbation and momentum acceleration, dubbed AdaSAM, has already been explored. We conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMS, and SAMsGrad.
arXiv Detail & Related papers (2023-03-01T15:12:42Z)
Optimizing DDPM Sampling with Shortcut Fine-Tuning [16.137936204766692]
Shortcut Fine-Tuning (SFT) is a new approach for addressing the challenge of fast sampling of pretrained Denoising Diffusion Probabilistic Models (DDPMs) SFT advocates for the fine-tuning of DDPM samplers through the direct minimization of Integral Probability Metrics (IPM) Inspired by a control perspective, we propose a new algorithm SFT-PG: Shortcut Fine-Tuning with Policy Gradient.
arXiv Detail & Related papers (2023-01-31T01:37:48Z)
Preferential Subsampling for Stochastic Gradient Langevin Dynamics [3.158346511479111]
gradient MCMC offers an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. The resulting gradient estimator may exhibit a high variance and impact sampler performance. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
arXiv Detail & Related papers (2022-10-28T14:56:18Z)
Rethinking Sharpness-Aware Minimization as Variational Inference [1.749935196721634]
Sharpness-aware (SAM) aims to improve the generalisation of gradient-based learning by seeking out flat minima. We establish connections between SAM and Mean-Field Variational Inference (MFVI) of neural network parameters.
arXiv Detail & Related papers (2022-10-19T10:35:54Z)
Towards Efficient and Scalable Sharpness-Aware Minimization [81.22779501753695]
We propose a novel algorithm LookSAM that only periodically calculates the inner gradient ascent. LookSAM achieves similar accuracy gains to SAM while being tremendously faster. We are the first to successfully scale up the batch size when training Vision Transformers (ViTs)
arXiv Detail & Related papers (2022-03-05T11:53:37Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.