Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
- URL: http://arxiv.org/abs/2009.09835v1
- Date: Fri, 18 Sep 2020 02:18:44 GMT
- Title: Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
- Authors: Pan Zhou, Xiaotong Yuan
- Abstract summary: We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
- Score: 83.80460802169999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic variance-reduced gradient (SVRG) algorithms have been shown to
work favorably in solving large-scale learning problems. Despite the remarkable
success, the stochastic gradient complexity of SVRG-type algorithms usually
scales linearly with data size and thus could still be expensive for huge data.
To address this deficiency, we propose a hybrid stochastic-deterministic
minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems
that enjoys provably improved data-size-independent complexity guarantees. More
precisely, for quadratic loss $F(\theta)$ of $n$ components, we prove that
HSDMPG can attain an $\epsilon$-optimization-error
$\mathbb{E}[F(\theta)-F(\theta^*)]\leq\epsilon$ within
$\mathcal{O}\Big(\frac{\kappa^{1.5}\epsilon^{0.75}\log^{1.5}(\frac{1}{\epsilon})+1}{\epsilon}\wedge\Big(\kappa
\sqrt{n}\log^{1.5}\big(\frac{1}{\epsilon}\big)+n\log\big(\frac{1}{\epsilon}\big)\Big)\Big)$
stochastic gradient evaluations, where $\kappa$ is condition number. For
generic strongly convex loss functions, we prove a nearly identical complexity
bound though at the cost of slightly increased logarithmic factors. For
large-scale learning problems, our complexity bounds are superior to those of
the prior state-of-the-art SVRG algorithms with or without dependence on data
size. Particularly, in the case of $\epsilon=\mathcal{O}\big(1/\sqrt{n}\big)$
which is at the order of intrinsic excess error bound of a learning model and
thus sufficient for generalization, the stochastic gradient complexity bounds
of HSDMPG for quadratic and generic loss functions are respectively
$\mathcal{O} (n^{0.875}\log^{1.5}(n))$ and $\mathcal{O}
(n^{0.875}\log^{2.25}(n))$, which to our best knowledge, for the first time
achieve optimal generalization in less than a single pass over data. Extensive
numerical results demonstrate the computational advantages of our algorithm
over the prior ones.
Related papers
- Obtaining Lower Query Complexities through Lightweight Zeroth-Order Proximal Gradient Algorithms [65.42376001308064]
We propose two variance reduced ZO estimators for complex gradient problems.
We improve the state-of-the-art function complexities from $mathcalOleft(minfracdn1/2epsilon2, fracdepsilon3right)$ to $tildecalOleft(fracdepsilon2right)$.
arXiv Detail & Related papers (2024-10-03T15:04:01Z) - Gradient Compressed Sensing: A Query-Efficient Gradient Estimator for High-Dimensional Zeroth-Order Optimization [48.84672493756553]
We propose a query-efficient and accurate estimator for gradients that uses only $Obig(slogfrac dsbig)$ queries per step.
Our proposed GraCe generalizes the Indyk--Price--Woodruff (IPW) algorithm in compressed sensing from linear measurements to nonlinear functions.
arXiv Detail & Related papers (2024-05-27T03:52:53Z) - The Sample Complexity Of ERMs In Stochastic Convex Optimization [13.896417716930687]
We show that in fact $tildeO(fracdepsilon+frac1epsilon2)$ data points are also sufficient.
We further generalize the result and show that a similar upper bound holds for all convex bodies.
arXiv Detail & Related papers (2023-11-09T14:29:25Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic
Composite Optimization [10.762749887051546]
We propose two-time scale algorithms: ProxDAS-A and Proxcal$DASA-GT.
Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-it operations, or stronger assumptions.
arXiv Detail & Related papers (2023-02-20T05:16:18Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Robust Sparse Mean Estimation via Sum of Squares [42.526664955704746]
We study the problem of high-dimensional sparse mean estimation in the presence of an $epsilon$-fraction of adversarial outliers.
Our algorithms follow the Sum-of-Squares based, to algorithms approach.
arXiv Detail & Related papers (2022-06-07T16:49:54Z) - Optimal Gradient Sliding and its Application to Distributed Optimization
Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity.
We proposed a method to solve problems master to agents' communication and local calls.
The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z) - Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression [39.29885444997579]
A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD)
We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
arXiv Detail & Related papers (2020-11-04T20:10:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.