How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
- URL: http://arxiv.org/abs/2305.18270v3
- Date: Fri, 15 Dec 2023 22:10:05 GMT
- Title: How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
- Authors: Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic
Stephan
- Abstract summary: We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function.
We compare the influence of batch size and that of multiple (but finitely many) steps.
We show that a batch-size of $n = mathcalO(d)$ is indeed enough to learn multiple target directions satisfying a staircase property.
- Score: 24.773974771715956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate theoretically how the features of a two-layer neural network
adapt to the structure of the target function through a few large batch
gradient descent steps, leading to improvement in the approximation capacity
with respect to the initialization. We compare the influence of batch size and
that of multiple (but finitely many) steps. For a single gradient step, a batch
of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the
target function, although only a single direction can be learned. In contrast,
$n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple
relevant directions of the target with a single gradient step. Even in this
case, we show there might exist ``hard'' directions requiring $n =
\mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap
index of the target. The picture drastically improves over multiple gradient
steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to
learn multiple target directions satisfying a staircase property, where more
and more directions can be learned over time. Finally, we discuss how these
directions allows to drastically improve the approximation capacity and
generalization error over the initialization, illustrating a separation of
scale between the random features/lazy regime, and the feature learning regime.
Our technical analysis leverages a combination of techniques related to
concentration, projection-based conditioning, and Gaussian equivalence which we
believe are of independent interest. By pinning down the conditions necessary
for specialization and learning, our results highlight the interaction between
batch size and number of iterations, and lead to a hierarchical depiction where
learning performance exhibits a stairway to accuracy over time and batch size,
shedding new light on how neural networks adapt to features of the data.
Related papers
- Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks.
Approximate MU is a practical method for large-scale models.
We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z) - Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs [24.305423716384272]
We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD)
We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity.
We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
arXiv Detail & Related papers (2024-06-04T09:44:49Z) - Fundamental computational limits of weak learnability in high-dimensional multi-index models [30.501140910531017]
This paper focuses on the minimum sample complexity required for weakly recovering their low-dimensional structure with first-order iterative algorithms.
Our findings unfold in three parts: (i) we identify under which conditions a trivial subspace can be learned with a single step of a first-order algorithm for any $alpha!>!0$; (ii) if the trivial subspace is empty, we provide necessary and sufficient conditions for the existence of an easy subspace.
In a limited but interesting set of really hard directions -- akin to the parity problem -- $alpha_c$ is found
arXiv Detail & Related papers (2024-05-24T11:59:02Z) - Step-size Optimization for Continual Learning [5.834516080130717]
In continual learning, a learner has to keep learning from the data over its whole life time.
In a neural network, this can be implemented by using a step-size vector to scale how much samples change network weights.
Common algorithms, like RMSProp and Adam, use gradients, specifically normalization, to adapt this step-size vector.
arXiv Detail & Related papers (2024-01-30T19:35:43Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Learning to Accelerate by the Methods of Step-size Planning [11.65690857661528]
Gradient descent is slow to converge for ill-conditioned problems and non-dimensional problems.
Step-size adaptation is an important technique for acceleration.
We show that our methods can surpass the convergence rate of Nesterov's accelerated rate.
arXiv Detail & Related papers (2022-04-01T19:59:40Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.