Gradient Descent for Low-Rank Functions
- URL: http://arxiv.org/abs/2206.08257v1
- Date: Thu, 16 Jun 2022 15:58:05 GMT
- Title: Gradient Descent for Low-Rank Functions
- Authors: Romain Cosson, Ali Jadbabaie, Anuran Makur, Amirhossein Reisizadeh,
Devavrat Shah
- Abstract summary: In machine learning tasks, e.g., training deep neural networks, the loss function varies significantly in only a few directions of the input.
Our proposed emphLowRank Descent finds an $epsilon gradient function by first identifying $mathcalO(plog(1/epsilon))$gd and $mathcalOp/epsilon2)$p/epsilon2)$.
- Score: 36.56489593549855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent empirical studies demonstrate that important machine learning
tasks, e.g., training deep neural networks, exhibit low-rank structure, where
the loss function varies significantly in only a few directions of the input
space. In this paper, we leverage such low-rank structure to reduce the high
computational cost of canonical gradient-based methods such as gradient descent
(GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an
$\epsilon$-approximate stationary point of a $p$-dimensional function by first
identifying $r \leq p$ significant directions, and then estimating the true
$p$-dimensional gradient at every iteration by computing directional
derivatives only along those $r$ directions. We establish that the "directional
oracle complexities" of LRGD for strongly convex and non-convex objective
functions are $\mathcal{O}(r \log(1/\epsilon) + rp)$ and
$\mathcal{O}(r/\epsilon^2 + rp)$, respectively. When $r \ll p$, these
complexities are smaller than the known complexities of $\mathcal{O}(p
\log(1/\epsilon))$ and $\mathcal{O}(p/\epsilon^2)$ of {\gd} in the strongly
convex and non-convex settings, respectively. Thus, LRGD significantly reduces
the computational cost of gradient-based methods for sufficiently low-rank
functions. In the course of our analysis, we also formally define and
characterize the classes of exact and approximately low-rank functions.
Related papers
- Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers [1.189367612437469]
We explicitly determine local and global minimizers of the $mathcalL2$ cost function in underparametrized Deep Learning (DL) networks.
arXiv Detail & Related papers (2023-09-19T14:20:55Z) - Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization [1.189367612437469]
We consider shallow neural networks with one hidden layer, a ReLU activation function, an $mathcal L2$ Schatten class (or Hilbert-Schmidt) cost function.
We prove an upper bound on the minimum of the cost function of order $O(delta_P)$ where $delta_P$ measures the signal to noise ratio of training inputs.
In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Qleq M$ by a
arXiv Detail & Related papers (2023-09-19T07:12:41Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - On Acceleration of Gradient-Based Empirical Risk Minimization using
Local Polynomial Regression [0.491574468325115]
We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPIGD) recently proposed for the approximate solution empirical risk problems (ERM)
We focus on loss functions that are strongly convex and smooth with condition number $sigma$.
We propose two accelerated methods for the problem based on LPI-GD and show an oracle complexity of $tildeOleft(sqrtsigma md log (1/varepsilon)$.
arXiv Detail & Related papers (2022-04-16T02:39:45Z) - Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression [39.29885444997579]
A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD)
We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
arXiv Detail & Related papers (2020-11-04T20:10:31Z) - A One-bit, Comparison-Based Gradient Estimator [29.600975900977343]
We show how one can use tools from one-bit compressed sensing to construct a robust and reliable estimator of the normalized gradient.
We propose an algorithm, coined SCOBO, that uses this estimator within a gradient descent scheme.
arXiv Detail & Related papers (2020-10-06T05:01:38Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.