Debiasing Distributed Second Order Optimization with Surrogate Sketching
and Scaled Regularization
- URL: http://arxiv.org/abs/2007.01327v1
- Date: Thu, 2 Jul 2020 18:08:14 GMT
- Title: Debiasing Distributed Second Order Optimization with Surrogate Sketching
and Scaled Regularization
- Authors: Micha{\l} Derezi\'nski, Burak Bartan, Mert Pilanci and Michael W.
Mahoney
- Abstract summary: In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data.
Here, we introduce a new technique for debiasing the local estimates, which leads to both theoretical and empirical improvements in the convergence rate of distributed second order methods.
- Score: 101.5159744660701
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In distributed second order optimization, a standard strategy is to average
many local estimates, each of which is based on a small sketch or batch of the
data. However, the local estimates on each machine are typically biased,
relative to the full solution on all of the data, and this can limit the
effectiveness of averaging. Here, we introduce a new technique for debiasing
the local estimates, which leads to both theoretical and empirical improvements
in the convergence rate of distributed second order methods. Our technique has
two novel components: (1) modifying standard sketching techniques to obtain
what we call a surrogate sketch; and (2) carefully scaling the global
regularization parameter for local computations. Our surrogate sketches are
based on determinantal point processes, a family of distributions for which the
bias of an estimate of the inverse Hessian can be computed exactly. Based on
this computation, we show that when the objective being minimized is
$l_2$-regularized with parameter $\lambda$ and individual machines are each
given a sketch of size $m$, then to eliminate the bias, local estimates should
be computed using a shrunk regularization parameter given by
$\lambda^{\prime}=\lambda\cdot(1-\frac{d_{\lambda}}{m})$, where $d_{\lambda}$
is the $\lambda$-effective dimension of the Hessian (or, for quadratic
problems, the data matrix).
Related papers
- Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Rate Analysis of Coupled Distributed Stochastic Approximation for Misspecified Optimization [0.552480439325792]
We consider an $n$ agents distributed optimization problem with imperfect information characterized in a parametric sense.
We propose a coupled distributed approximation algorithm, in which every agent updates the current beliefs of its unknown parameter.
We quantitatively characterize the factors that affect the algorithm performance, and prove that the mean-squared error of the decision variable is bounded by $mathcalO(frac1nk)+mathcalOleft(frac1sqrtn (1-rho_w)right)frac1k1.5
arXiv Detail & Related papers (2024-04-21T14:18:49Z) - Explicit Second-Order Min-Max Optimization Methods with Optimal Convergence Guarantee [86.05440220344755]
We propose and analyze inexact regularized Newton-type methods for finding a global saddle point of emphcon unconstrained min-max optimization problems.
We show that the proposed methods generate iterates that remain within a bounded set and that the iterations converge to an $epsilon$-saddle point within $O(epsilon-2/3)$ in terms of a restricted function.
arXiv Detail & Related papers (2022-10-23T21:24:37Z) - DRSOM: A Dimension Reduced Second-Order Method [13.778619250890406]
Under a trust-like framework, our method preserves the convergence of the second-order method while using only information in a few directions.
Theoretically, we show that the method has a local convergence and a global convergence rate of $O(epsilon-3/2)$ to satisfy the first-order and second-order conditions.
arXiv Detail & Related papers (2022-07-30T13:05:01Z) - Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.
We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z) - Distributed Sketching for Randomized Optimization: Exact
Characterization, Concentration and Lower Bounds [54.51566432934556]
We consider distributed optimization methods for problems where forming the Hessian is computationally challenging.
We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems.
arXiv Detail & Related papers (2022-03-18T05:49:13Z) - Stochastic regularized majorization-minimization with weakly convex and
multi-convex surrogates [0.0]
We show that the first optimality gap of the proposed algorithm decays at the rate of the expected loss for various methods under nontens dependent data setting.
We obtain first convergence point for various methods under nontens dependent data setting.
arXiv Detail & Related papers (2022-01-05T15:17:35Z) - Distributed Sparse Regression via Penalization [5.990069843501885]
We study linear regression over a network of agents, modeled as an undirected graph (with no centralized node)
The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint.
We show that the proximal-gradient algorithm applied to the penalized problem converges linearly up to a tolerance of the order of the centralized statistical error.
arXiv Detail & Related papers (2021-11-12T01:51:50Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Distributed Averaging Methods for Randomized Second Order Optimization [54.51566432934556]
We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a bottleneck.
We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian.
We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems.
arXiv Detail & Related papers (2020-02-16T09:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.