Training Quantised Neural Networks with STE Variants: the Additive Noise
Annealing Algorithm
- URL: http://arxiv.org/abs/2203.11323v1
- Date: Mon, 21 Mar 2022 20:14:27 GMT
- Title: Training Quantised Neural Networks with STE Variants: the Additive Noise
Annealing Algorithm
- Authors: Matteo Spallanzani, Gian Paolo Leonardi, Luca Benini
- Abstract summary: Training quantised neural networks (QNNs) is a non-differentiable problem since weights and features are output by piecewise constant functions.
The standard solution is to apply the straight-through estimator (STE), using different functions during the inference and computation steps.
Several STE variants have been proposed in the literature aiming to maximise the task accuracy of the trained network.
- Score: 16.340620299847384
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Training quantised neural networks (QNNs) is a non-differentiable
optimisation problem since weights and features are output by piecewise
constant functions. The standard solution is to apply the straight-through
estimator (STE), using different functions during the inference and gradient
computation steps. Several STE variants have been proposed in the literature
aiming to maximise the task accuracy of the trained network. In this paper, we
analyse STE variants and study their impact on QNN training. We first observe
that most such variants can be modelled as stochastic regularisations of stair
functions; although this intuitive interpretation is not new, our rigorous
discussion generalises to further variants. Then, we analyse QNNs mixing
different regularisations, finding that some suitably synchronised smoothing of
each layer map is required to guarantee pointwise compositional convergence to
the target discontinuous function. Based on these theoretical insights, we
propose additive noise annealing (ANA), a new algorithm to train QNNs
encompassing standard STE and its variants as special cases. When testing ANA
on the CIFAR-10 image classification benchmark, we find that the major impact
on task accuracy is not due to the qualitative shape of the regularisations but
to the proper synchronisation of the different STE variants used in a network,
in accordance with the theoretical results.
Related papers
- Quantification using Permutation-Invariant Networks based on Histograms [47.47360392729245]
Quantification is the supervised learning task in which a model is trained to predict the prevalence of each class in a given bag of examples.
This paper investigates the application of deep neural networks to tasks of quantification in scenarios where it is possible to apply a symmetric supervised approach.
We propose HistNetQ, a novel neural architecture that relies on a permutation-invariant representation based on histograms.
arXiv Detail & Related papers (2024-03-22T11:25:38Z) - Neural Network-Based Score Estimation in Diffusion Models: Optimization
and Generalization [12.812942188697326]
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness.
A key component of these models is to learn the score function through score matching.
Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy.
arXiv Detail & Related papers (2024-01-28T08:13:56Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Analyzing Convergence in Quantum Neural Networks: Deviations from Neural
Tangent Kernels [20.53302002578558]
A quantum neural network (QNN) is a parameterized mapping efficiently implementable on near-term Noisy Intermediate-Scale Quantum (NISQ) computers.
Despite the existing empirical and theoretical investigations, the convergence of QNN training is not fully understood.
arXiv Detail & Related papers (2023-03-26T22:58:06Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - AskewSGD : An Annealed interval-constrained Optimisation method to train
Quantized Neural Networks [12.229154524476405]
We develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights.
Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set.
Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
arXiv Detail & Related papers (2022-11-07T18:13:44Z) - Where Should We Begin? A Low-Level Exploration of Weight Initialization
Impact on Quantized Behaviour of Deep Neural Networks [93.4221402881609]
We present an in-depth, fine-grained ablation study of the effect of different weights initialization on the final distributions of weights and activations of different CNN architectures.
To our best knowledge, we are the first to perform such a low-level, in-depth quantitative analysis of weights initialization and its effect on quantized behaviour.
arXiv Detail & Related papers (2020-11-30T06:54:28Z) - Analytical aspects of non-differentiable neural networks [0.0]
We discuss the expressivity of quantized neural networks and approximation techniques for non-differentiable networks.
We show that QNNs have the same expressivity as DNNs in terms of approximation of Lipschitz functions in the $Linfty$ norm.
We also consider networks defined by means of Heaviside-type activation functions, and prove for them a pointwise approximation result by means of smooth networks.
arXiv Detail & Related papers (2020-11-03T17:20:43Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.