Bridging Discrete and Backpropagation: Straight-Through and Beyond
- URL: http://arxiv.org/abs/2304.08612v3
- Date: Mon, 16 Oct 2023 16:27:02 GMT
- Title: Bridging Discrete and Backpropagation: Straight-Through and Beyond
- Authors: Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, Jianfeng Gao
- Abstract summary: We propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables.
We propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.
- Score: 62.46558842476455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backpropagation, the cornerstone of deep learning, is limited to computing
gradients for continuous variables. This limitation poses challenges for
problems involving discrete latent variables. To address this issue, we propose
a novel approach to approximate the gradient of parameters involved in
generating discrete latent variables. First, we examine the widely used
Straight-Through (ST) heuristic and demonstrate that it works as a first-order
approximation of the gradient. Guided by our findings, we propose ReinMax,
which achieves second-order accuracy by integrating Heun's method, a
second-order numerical method for solving ODEs. ReinMax does not require
Hessian or other second-order derivatives, thus having negligible computation
overheads. Extensive experimental results on various tasks demonstrate the
superiority of ReinMax over the state of the art. Implementations are released
at https://github.com/microsoft/ReinMax.
Related papers
- Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Exact, Fast and Expressive Poisson Point Processes via Squared Neural
Families [23.337256081314518]
We introduce squared neural Poisson point processes (SNEPPPs) by parameterising the intensity function by the squared norm of a two layer neural network.
When the hidden layer is fixed and the second layer has a single neuron, our approach resembles previous uses of squared Gaussian process or kernel methods.
We demonstrate SNEPPPs on real, and synthetic benchmarks, and provide a software implementation.
arXiv Detail & Related papers (2024-02-14T22:32:00Z) - HOUDINI: Escaping from Moderately Constrained Saddles [14.277428617774875]
We show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints.
Our results hold for both regular and gradient descent.
arXiv Detail & Related papers (2022-05-27T03:36:27Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic
Convex Optimization [4.970364068620608]
We exploit convexity and L-smoothness to improve the noisy estimates outputted by the gradient oracle.
We show that increasing the number and proximity of the queried points leads to better gradient estimates.
We also apply COCO in vanilla settings by plugging it in existing algorithms, such as SGD, Adam or STRSAGA.
arXiv Detail & Related papers (2021-09-07T17:21:09Z) - Robust High Dimensional Expectation Maximization Algorithm via Trimmed
Hard Thresholding [24.184520829631587]
We study the problem of estimating latent variable models with arbitrarily corrupted samples in high dimensional space.
We propose a method called Trimmed (Gradient) Expectation Maximization which adds a trimming gradient step.
We show that the algorithm is corruption-proofing and converges to the (near) optimal statistical rate geometrically.
arXiv Detail & Related papers (2020-10-19T15:00:35Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.