Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks
- URL: http://arxiv.org/abs/2006.03143v2
- Date: Wed, 4 Nov 2020 14:23:51 GMT
- Title: Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks
- Authors: Alexander Shekhovtsov, Viktor Yanush, Boris Flach
- Abstract summary: In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
- Score: 78.76880041670904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In neural networks with binary activations and or binary weights the training
by gradient descent is complicated as the model has piecewise constant
response. We consider stochastic binary networks, obtained by adding noises in
front of activations. The expected model response becomes a smooth function of
parameters, its gradient is well defined but it is challenging to estimate it
accurately. We propose a new method for this estimation problem combining
sampling and analytic approximation steps. The method has a significantly
reduced variance at the price of a small bias which gives a very practical
tradeoff in comparison with existing unbiased and biased estimators. We further
show that one extra linearization step leads to a deep straight-through
estimator previously known only as an ad-hoc heuristic. We experimentally show
higher accuracy in gradient estimation and demonstrate a more stable and better
performing training in deep convolutional models with both proposed methods.
Related papers
- Epistemic Uncertainty and Observation Noise with the Neural Tangent Kernel [12.464924018243988]
Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process.
We show how to deal with non-zero aleatoric noise and derive an estimator for the posterior covariance.
arXiv Detail & Related papers (2024-09-06T00:34:44Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Sampling from Gaussian Process Posteriors using Stochastic Gradient
Descent [43.097493761380186]
gradient algorithms are an efficient method of approximately solving linear systems.
We show that gradient descent produces accurate predictions, even in cases where it does not converge quickly to the optimum.
Experimentally, gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks.
arXiv Detail & Related papers (2023-06-20T15:07:37Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the influence of roundoff errors on the convergence of the gradient
descent method with low-precision floating-point computation [0.0]
We propose a new rounding scheme that trades the zero bias property with a larger probability to preserve small gradients.
Our method yields constant rounding bias that, at each iteration, lies in a descent direction.
arXiv Detail & Related papers (2022-02-24T18:18:20Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Mean-Field Approximation to Gaussian-Softmax Integral with Application
to Uncertainty Estimation [23.38076756988258]
We propose a new single-model based approach to quantify uncertainty in deep neural networks.
We use a mean-field approximation formula to compute an analytically intractable integral.
Empirically, the proposed approach performs competitively when compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-06-13T07:32:38Z) - Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights.
Many successful experimental results have been achieved with empirical straight-through (ST) approaches.
At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.