Related papers: One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware Quantization Training

One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware Quantization Training

URL: http://arxiv.org/abs/2401.16760v1
Date: Tue, 30 Jan 2024 05:42:54 GMT
Title: One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware Quantization Training
Authors: Lianbo Ma, Yuee Zhou, Jianlun Ma, Guo Yu, Qing Li
Abstract summary: Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources. Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient. This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
Score: 12.400950982075948
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources. Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient. However, we discover that the gradient error will lead to an unexpected zig-zagging-like issue in the gradient descent learning procedures, where the gradient directions rapidly oscillate or zig-zag, and such issue seriously slows down the model convergence. Accordingly, this paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction to defy this issue. During the gradient descent learning, a one-step forward search is designed to find the trial gradient of the next-step, which is adopted to adjust the gradient of current step towards the direction of fast convergence. After that, we backtrack the current step to update the full-precision and quantized weights through the current-step gradient and the trial gradient. A series of theoretical analysis and experiments on benchmark deep models have demonstrated the effectiveness and competitiveness of the proposed method, and our method especially outperforms others on the convergence performance.

Related papers

Fast and Slow Gradient Approximation for Binary Neural Network Optimization [11.064044986709733]
hypernetwork based methods utilize neural networks to learn the gradients of non-differentiable quantization functions. We propose a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. We also introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients.
arXiv Detail & Related papers (2024-12-16T13:48:40Z)
Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients [24.973203825917906]
We show that lowering the error for large-magnitude gradients boosts the quantization performance significantly. We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients.
arXiv Detail & Related papers (2024-07-17T15:06:12Z)
Point Cloud Denoising via Momentum Ascent in Gradient Fields [72.93429911044903]
gradient-based method was proposed to estimate the gradient fields from the noisy point clouds using neural networks. We develop a momentum gradient ascent method that leverages the information of previous iterations in determining the trajectories of the points. Experiments demonstrate that the proposed method outperforms state-of-the-art approaches with a variety of point clouds, noise types, and noise levels.
arXiv Detail & Related papers (2022-02-21T10:21:40Z)
On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters. The plain gradient descent has not been corrected by the accumulated gradient. The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z)
Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda. Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z)
Channel-Directed Gradients for Optimization of Convolutional Neural Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)
Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient [51.880464915253924]
Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance. This paper introduces the framework for updating the gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN.
arXiv Detail & Related papers (2020-07-25T00:54:20Z)
Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning. We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models. We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.