On the Effectiveness of Richardson Extrapolation in Machine Learning
- URL: http://arxiv.org/abs/2002.02835v3
- Date: Fri, 17 Jul 2020 14:38:05 GMT
- Title: On the Effectiveness of Richardson Extrapolation in Machine Learning
- Authors: Francis Bach (LIENS, SIERRA)
- Abstract summary: Richardson is a technique from numerical analysis that can improve the approximation error of an estimation method.
We show that Richardson extrapolation techniques come with no significant loss in performance, but with sometimes strong gains.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Richardson extrapolation is a classical technique from numerical analysis
that can improve the approximation error of an estimation method by combining
linearly several estimates obtained from different values of one of its
hyperparameters, without the need to know in details the inner structure of the
original estimation method. The main goal of this paper is to study when
Richardson extrapolation can be used within machine learning, beyond the
existing applications to step-size adaptations in stochastic gradient descent.
We identify two situations where Richardson interpolation can be useful: (1)
when the hyperparameter is the number of iterations of an existing iterative
optimization algorithm, with applications to averaged gradient descent and
Frank-Wolfe algorithms (where we obtain asymptotically rates of $O(1/k^2)$ on
polytopes, where $k$ is the number of iterations), and (2) when it is a
regularization parameter, with applications to Nesterov smoothing techniques
for minimizing non-smooth functions (where we obtain asymptotically rates close
to $O(1/k^2)$ for non-smooth functions), and ridge regression. In all these
cases, we show that extrapolation techniques come with no significant loss in
performance, but with sometimes strong gains, and we provide theoretical
justifications based on asymptotic developments for such gains, as well as
empirical illustrations on classical problems from machine learning.
Related papers
- Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity [2.5382095320488665]
This paper proposes to develop a new variant of the two-time-scale monotone approximation to find the roots of two coupled nonlinear operators.
Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples.
The estimated values of these averaging steps will then be used in the two-time-scale approximation updates to find the desired solution.
arXiv Detail & Related papers (2024-01-23T13:44:15Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient [6.563379950720334]
We introduce a new Langevin dynamics based algorithm, called e-TH$varepsilon$O POULA, to solve optimization problems with discontinuous gradients.
Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction.
arXiv Detail & Related papers (2022-10-24T13:10:06Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Non-asymptotic estimates for TUSLA algorithm for non-convex learning
with applications to neural networks with ReLU activation function [3.5044892799305956]
We provide a non-asymptotic analysis for the tamed un-adjusted Langevin algorithm (TUSLA) introduced in Lovas et al.
In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wassersteinstein-1-2.
We show that the TUSLA algorithm converges rapidly to the optimal solution.
arXiv Detail & Related papers (2021-07-19T07:13:02Z) - L2M: Practical posterior Laplace approximation with optimization-driven
second moment estimation [0.0]
Uncertainty quantification for deep neural networks has recently evolved through many techniques.
We show that, under some regularity conditions, the Laplace approximation can be easily constructed using the gradient second moment.
arXiv Detail & Related papers (2021-07-09T22:14:54Z) - Near-Optimal High Probability Complexity Bounds for Non-Smooth
Stochastic Optimization with Heavy-Tailed Noise [63.304196997102494]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity with bounds dependence on the confidence level that is either negative-power or logarithmic.
We propose novel stepsize rules for two gradient methods with clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.