Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent:
Convergence Guarantees and Empirical Benefits
- URL: http://arxiv.org/abs/2111.10461v1
- Date: Fri, 19 Nov 2021 22:28:47 GMT
- Title: Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent:
Convergence Guarantees and Empirical Benefits
- Authors: Hao Chen, Lili Zheng, Raed Al Kontar, Garvesh Raskutti
- Abstract summary: gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems.
We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function.
Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
- Score: 21.353189917487512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) and its variants have established
themselves as the go-to algorithms for large-scale machine learning problems
with independent samples due to their generalization performance and intrinsic
computational advantage. However, the fact that the stochastic gradient is a
biased estimator of the full gradient with correlated samples has led to the
lack of theoretical understanding of how SGD behaves under correlated settings
and hindered its use in such cases. In this paper, we focus on hyperparameter
estimation for the Gaussian process (GP) and take a step forward towards
breaking the barrier by proving minibatch SGD converges to a critical point of
the full log-likelihood loss function, and recovers model hyperparameters with
rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term
depending on the minibatch size. Our theoretical guarantees hold provided that
the kernel functions exhibit exponential or polynomial eigendecay which is
satisfied by a wide range of kernels commonly used in GPs. Numerical studies on
both simulated and real datasets demonstrate that minibatch SGD has better
generalization over state-of-the-art GP methods while reducing the
computational burden and opening a new, previously unexplored, data size regime
for GPs.
Related papers
- Amortized Variational Inference for Deep Gaussian Processes [0.0]
Deep Gaussian processes (DGPs) are multilayer generalizations of Gaussian processes (GPs)
We introduce amortized variational inference for DGPs, which learns an inference function that maps each observation to variational parameters.
Our method performs similarly or better than previous approaches at less computational cost.
arXiv Detail & Related papers (2024-09-18T20:23:27Z) - Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution [6.144680854063938]
We consider a variant of the gradient descent (SGD) with a random learning rate to reveal its convergence properties.
We demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions.
arXiv Detail & Related papers (2024-06-23T06:52:33Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Non Asymptotic Bounds for Optimization via Online Multiplicative
Stochastic Gradient Descent [0.0]
The gradient noise of Gradient Descent (SGD) is considered to play a key role in its properties.
We show that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties.
We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
arXiv Detail & Related papers (2021-12-14T02:25:43Z) - Non-Gaussian Gaussian Processes for Few-Shot Regression [71.33730039795921]
We propose an invertible ODE-based mapping that operates on each component of the random variable vectors and shares the parameters across all of them.
NGGPs outperform the competing state-of-the-art approaches on a diversified set of benchmarks and applications.
arXiv Detail & Related papers (2021-10-26T10:45:25Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Incremental Ensemble Gaussian Processes [53.3291389385672]
We propose an incremental ensemble (IE-) GP framework, where an EGP meta-learner employs an it ensemble of GP learners, each having a unique kernel belonging to a prescribed kernel dictionary.
With each GP expert leveraging the random feature-based approximation to perform online prediction and model update with it scalability, the EGP meta-learner capitalizes on data-adaptive weights to synthesize the per-expert predictions.
The novel IE-GP is generalized to accommodate time-varying functions by modeling structured dynamics at the EGP meta-learner and within each GP learner.
arXiv Detail & Related papers (2021-10-13T15:11:25Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z) - Conditional Deep Gaussian Processes: multi-fidelity kernel learning [6.599344783327053]
We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data.
Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods.
We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function.
arXiv Detail & Related papers (2020-02-07T14:56:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.