Over-Parameterization Exponentially Slows Down Gradient Descent for
Learning a Single Neuron
- URL: http://arxiv.org/abs/2302.10034v2
- Date: Tue, 10 Oct 2023 05:55:10 GMT
- Title: Over-Parameterization Exponentially Slows Down Gradient Descent for
Learning a Single Neuron
- Authors: Weihang Xu, Simon S. Du
- Abstract summary: We prove the global convergence of randomly gradient descent with a $Oleft(T-3right)$ rate.
These two bounds jointly give an exact characterization of the convergence rate.
We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
- Score: 49.45105570960104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit the problem of learning a single neuron with ReLU activation under
Gaussian input with square loss. We particularly focus on the
over-parameterization setting where the student network has $n\ge 2$ neurons.
We prove the global convergence of randomly initialized gradient descent with a
$O\left(T^{-3}\right)$ rate. This is the first global convergence result for
this problem beyond the exact-parameterization setting ($n=1$) in which the
gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we
further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly
initialized gradient flow in the over-parameterization setting. These two
bounds jointly give an exact characterization of the convergence rate and
imply, for the first time, that over-parameterization can exponentially slow
down the convergence rate. To prove the global convergence, we need to tackle
the interactions among student neurons in the gradient descent dynamics, which
are not present in the exact-parameterization case. We use a three-phase
structure to analyze GD's dynamics. Along the way, we prove gradient descent
automatically balances student neurons, and use this property to deal with the
non-smoothness of the objective function. To prove the convergence rate lower
bound, we construct a novel potential function that characterizes the pairwise
distances between the student neurons (which cannot be done in the
exact-parameterization case). We show this potential function converges slowly,
which implies the slow convergence rate of the loss function.
Related papers
- Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks [3.680127959836384]
First-order methods, such as gradient descent (GD) and quadratic descent gradient (SGD) have been proven effective in training neural networks.
However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix.
In this paper, we show that for the $L2$ regression problems, the learning rate can be improved from $mathcalO(1)$ to $mathcalO(1)$, which implies that GD actually enjoys a faster convergence rate.
arXiv Detail & Related papers (2024-08-01T14:06:34Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Analysis of the expected $L_2$ error of an over-parametrized deep neural
network estimate learned by gradient descent without regularization [7.977229957867868]
Recent results show that estimates defined by over-parametrized deep neural networks learned by applying gradient descent to a regularized empirical $L$ risk are universally consistent.
In this paper, we show that the regularization term is not necessary to obtain similar results.
arXiv Detail & Related papers (2023-11-24T17:04:21Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Learning Unitaries by Gradient Descent [12.354076490479516]
We learn unitary transformations in $Ud)$ via gradient descent on time parameters of alternating sequences.
With less gradient$ parameters, gradient descent converges to a sub-optimal solution, whereas with more than $d2$ parameters, gradient descent converges to an optimal solution.
arXiv Detail & Related papers (2020-01-31T15:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.