Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks
- URL: http://arxiv.org/abs/2006.06880v4
- Date: Tue, 19 Oct 2021 14:45:41 GMT
- Title: Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks
- Authors: Alexander Shekhovtsov, Viktor Yanush
- Abstract summary: Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights.
Many successful experimental results have been achieved with empirical straight-through (ST) approaches.
At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
- Score: 85.94999581306827
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training neural networks with binary weights and activations is a challenging
problem due to the lack of gradients and difficulty of optimization over
discrete weights. Many successful experimental results have been achieved with
empirical straight-through (ST) approaches, proposing a variety of ad-hoc rules
for propagating gradients through non-differentiable activations and updating
discrete weights. At the same time, ST methods can be truly derived as
estimators in the stochastic binary network (SBN) model with Bernoulli weights.
We advance these derivations to a more complete and systematic study. We
analyze properties, estimation accuracy, obtain different forms of correct ST
estimators for activations and weights, explain existing empirical approaches
and their shortcomings, explain how latent weights arise from the mirror
descent method when optimizing over probabilities. This allows to reintroduce
ST methods, long known empirically, as sound approximations, apply them with
clarity and develop further improvements.
Related papers
- On Training Implicit Meta-Learning With Applications to Inductive
Weighing in Consistency Regularization [0.0]
Implicit meta-learning (IML) requires computing $2nd$ order gradients, particularly the Hessian.
Various approximations for the Hessian were proposed but a systematic comparison of their compute cost, stability, generalization of solution found and estimation accuracy were largely overlooked.
We show how training a "Confidence Network" to extract domain specific features can learn to up-weigh useful images and down-weigh out-of-distribution samples.
arXiv Detail & Related papers (2023-10-28T15:50:03Z) - Convergence of uncertainty estimates in Ensemble and Bayesian sparse
model discovery [4.446017969073817]
We show empirical success in terms of accuracy and robustness to noise with bootstrapping-based sequential thresholding least-squares estimator.
We show that this bootstrapping-based ensembling technique can perform a provably correct variable selection procedure with an exponential convergence rate of the error rate.
arXiv Detail & Related papers (2023-01-30T04:07:59Z) - On the Overlooked Structure of Stochastic Gradients [34.650998241703626]
We show that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and gradient noise caused by minibatch training usually do not exhibit power-law heavy tails.
Our work challenges the existing belief and provides novel insights on the structure of gradients in deep learning.
arXiv Detail & Related papers (2022-12-05T07:55:22Z) - Calibrated and Sharp Uncertainties in Deep Learning via Simple Density
Estimation [7.184701179854522]
This paper argues for reasoning about uncertainty in terms these properties and proposes simple algorithms for enforcing them in deep learning.
Our methods focus on the strongest notion of calibration--distribution calibration--and enforce it by fitting a low-dimensional density or quantile function with a neural estimator.
Empirically, we find that our methods improve predictive uncertainties on several tasks with minimal computational and implementation overhead.
arXiv Detail & Related papers (2021-12-14T06:19:05Z) - Bias-Variance Tradeoffs in Single-Sample Binary Gradient Estimators [100.58924375509659]
Straight-through (ST) estimator gained popularity due to its simplicity and efficiency.
Several techniques were proposed to improve over ST while keeping the same low computational complexity.
We conduct a theoretical analysis of Bias and Variance of these methods in order to understand tradeoffs and verify originally claimed properties.
arXiv Detail & Related papers (2021-10-07T15:16:07Z) - Training Generative Adversarial Networks by Solving Ordinary
Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training.
From this perspective, we hypothesise that instabilities in training GANs arise from the integration error.
We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.