Super-efficiency of automatic differentiation for functions defined as a
minimum
- URL: http://arxiv.org/abs/2002.03722v1
- Date: Mon, 10 Feb 2020 13:23:01 GMT
- Title: Super-efficiency of automatic differentiation for functions defined as a
minimum
- Authors: Pierre Ablin, Gabriel Peyr\'e and Thomas Moreau
- Abstract summary: In min-min optimization, one has to compute the gradient of a function defined as a minimum.
We study the error made by these estimators as a function of the optimization error.
- Score: 16.02151272607889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In min-min optimization or max-min optimization, one has to compute the
gradient of a function defined as a minimum. In most cases, the minimum has no
closed-form, and an approximation is obtained via an iterative algorithm. There
are two usual ways of estimating the gradient of the function: using either an
analytic formula obtained by assuming exactness of the approximation, or
automatic differentiation through the algorithm. In this paper, we study the
asymptotic error made by these estimators as a function of the optimization
error. We find that the error of the automatic estimator is close to the square
of the error of the analytic estimator, reflecting a super-efficiency
phenomenon. The convergence of the automatic estimator greatly depends on the
convergence of the Jacobian of the algorithm. We analyze it for gradient
descent and stochastic gradient descent and derive convergence rates for the
estimators in these cases. Our analysis is backed by numerical experiments on
toy problems and on Wasserstein barycenter computation. Finally, we discuss the
computational complexity of these estimators and give practical guidelines to
chose between them.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - One-step differentiation of iterative algorithms [7.9495796547433395]
We study one-step differentiation, also known as Jacobian-free backpropagation, a method as easy as automatic differentiation.
We provide a complete theoretical approximation analysis with specific examples along with its consequences in bilevel optimization.
arXiv Detail & Related papers (2023-05-23T07:32:37Z) - A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random
Perturbations for Stochastic Optimization [10.820943271350442]
We present a convex gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples.
We also show that our algorithm avoids the ratelibria, implying convergence to local minima.
arXiv Detail & Related papers (2022-07-30T18:50:36Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - The estimation error of general first order methods [12.472245917779754]
We consider two families of estimation problems: high-dimensional regression and low-dimensional matrix estimation.
We derive lower bounds the error that hold in the high-dimensional optimals in which both the number of observations and the number of parameters diverge.
These lower bounds sense that there exist algorithms whose estimation error matches the lower bounds up to sparseally negligible terms.
arXiv Detail & Related papers (2020-02-28T18:13:47Z) - Stochastic Optimization for Regularized Wasserstein Estimators [10.194798773447879]
We introduce an algorithm to solve a regularized version of the problem of Wasserstein estimators gradient, with a time per step which is sublinear in the natural dimensions.
We show that this algorithm can be extended to other tasks, including estimation of Wasserstein barycenters.
arXiv Detail & Related papers (2020-02-20T12:04:05Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.