Related papers: GO Hessian for Expectation-Based Objectives

GO Hessian for Expectation-Based Objectives

URL: http://arxiv.org/abs/2006.08873v1
Date: Tue, 16 Jun 2020 02:20:41 GMT
Title: GO Hessian for Expectation-Based Objectives
Authors: Yulai Cong, Miaoyun Zhao, Jianqiao Li, Junya Chen, Lawrence Carin
Abstract summary: GO gradient was proposed recently for expectation-based objectives $mathbbE_q_boldsymbolboldsymbolgamma(boldsymboly) [f(boldsymboly)]$. Based on the GO gradient, we present for $mathbbE_q_boldsymbolboldsymbolgamma(boldsymboly) [f(boldsymboly)]$ an unbiased low-variance Hessian estimator, named GO Hessian.
Score: 73.06986780804269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives $\mathbb{E}_{q_{\boldsymbol{\gamma}}(\boldsymbol{y})} [f(\boldsymbol{y})]$, where the random variable (RV) $\boldsymbol{y}$ may be drawn from a stochastic computation graph with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Upgrading the GO gradient, we present for $\mathbb{E}_{q_{\boldsymbol{\boldsymbol{\gamma}}}(\boldsymbol{y})} [f(\boldsymbol{y})]$ an unbiased low-variance Hessian estimator, named GO Hessian. Considering practical implementation, we reveal that GO Hessian is easy-to-use with auto-differentiation and Hessian-vector products, enabling efficient cheap exploitation of curvature information over stochastic computation graphs. As representative examples, we present the GO Hessian for non-reparameterizable gamma and negative binomial RVs/nodes. Based on the GO Hessian, we design a new second-order method for $\mathbb{E}_{q_{\boldsymbol{\boldsymbol{\gamma}}}(\boldsymbol{y})} [f(\boldsymbol{y})]$, with rigorous experiments conducted to verify its effectiveness and efficiency.

Related papers

Markov Kernels, Distances and Optimal Control: A Parable of Linear Quadratic Non-Gaussian Distribution Steering [3.497013356387396]
kernel for the Ito Markov diffusion $mathrmdboldsymbolx_t=boldsymbolA_tboldsymbolx_t=boldsymbolA_tboldsymbolx_t=boldsymbolA_tboldsymbolx_t=boldsymbolA_tboldsymbolx_t=boldsymbolA_tboldsymbolx_t=boldsymbolA
arXiv Detail & Related papers (2025-04-22T10:07:43Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Targeted Variance Reduction: Robust Bayesian Optimization of Black-Box Simulators with Noise Parameters [1.7404865362620803]
We propose a new Bayesian optimization method called Targeted Variance Reduction (TVR) TVR leverages a novel joint acquisition function over $(mathbfx,boldsymboltheta)$, which targets variance reduction on the objective within the desired region of improvement. We demonstrate the improved performance of TVR over the state-of-the-art in a suite of numerical experiments and an application to the robust design of automobile brake discs.
arXiv Detail & Related papers (2024-03-06T16:03:37Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
Stochastic Zeroth Order Gradient and Hessian Estimators: Variance Reduction and Refined Bias Bounds [6.137707924685666]
We study zeroth order and Hessian estimators for real-valued functions in $mathbbRn$. We show that, via taking finite difference along random directions, the variance of gradient finite difference estimators can be significantly reduced.
arXiv Detail & Related papers (2022-05-29T18:53:24Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network. Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z)
Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP. We establish normality for the iteration averaged $barboldsymbolQ_T$. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z)
Random matrices in service of ML footprint: ternary random features with no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$. We propose a novel random technique, called Ternary Random Feature (TRF) The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z)
Tree-Projected Gradient Descent for Estimating Gradient-Sparse Parameters on Graphs [10.846572437131872]
We study estimation of a gradient-sparse parameter vector $boldsymboltheta* in mathbbRp$. We show that, under suitable restricted strong convexity and smoothness assumptions for the loss, the resulting estimator achieves the squared-error risk $fracs*n log (1+fracps*)$ up to a multiplicative constant that is independent of $G$.
arXiv Detail & Related papers (2020-05-31T20:08:13Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems [36.645753881826955]
In this paper, we propose a novel method called RecurEnti Ascent (SREDA), which estimates more efficiently using variance. This method achieves the best known for this problem.
arXiv Detail & Related papers (2020-01-11T09:05:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.