Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature
Learning and Lazy Training
- URL: http://arxiv.org/abs/2012.15110v1
- Date: Wed, 30 Dec 2020 11:00:36 GMT
- Title: Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature
Learning and Lazy Training
- Authors: Mario Geiger, Leonardo Petrini and Matthieu Wyart
- Abstract summary: Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing.
Yet, why they work is not understood. Ultimately, they manage to classify data lying in high dimension -- a feat generically impossible.
We argue that different learning regimes can be organized into a phase diagram.
- Score: 4.318555434063275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning algorithms are responsible for a technological revolution in a
variety of tasks including image recognition or Go playing. Yet, why they work
is not understood. Ultimately, they manage to classify data lying in high
dimension -- a feat generically impossible due to the geometry of high
dimensional space and the associated curse of dimensionality. Understanding
what kind of structure, symmetry or invariance makes data such as images
learnable is a fundamental challenge. Other puzzles include that (i) learning
corresponds to minimizing a loss in high dimension, which is in general not
convex and could well get stuck bad minima. (ii) Deep learning predicting power
increases with the number of fitting parameters, even in a regime where data
are perfectly fitted. In this manuscript, we review recent results elucidating
(i,ii) and the perspective they offer on the (still unexplained) curse of
dimensionality paradox. We base our theoretical discussion on the $(h,\alpha)$
plane where $h$ is the network width and $\alpha$ the scale of the output of
the network at initialization, and provide new systematic measures of
performance in that plane for MNIST and CIFAR 10. We argue that different
learning regimes can be organized into a phase diagram. A line of critical
points sharply delimits an under-parametrised phase from an over-parametrized
one. In over-parametrized nets, learning can operate in two regimes separated
by a smooth cross-over. At large initialization, it corresponds to a kernel
method, whereas for small initializations features can be learnt, together with
invariants in the data. We review the properties of these different phases, of
the transition separating them and some open questions. Our treatment
emphasizes analogies with physical systems, scaling arguments and the
development of numerical observables to quantitatively test these results
empirically.
Related papers
- Disentangled Representation Learning with the Gromov-Monge Gap [65.73194652234848]
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning.
We introduce a novel approach to disentangled representation learning based on quadratic optimal transport.
We demonstrate the effectiveness of our approach for quantifying disentanglement across four standard benchmarks.
arXiv Detail & Related papers (2024-07-10T16:51:32Z) - Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms.
We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency.
Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z) - Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Synergy and Symmetry in Deep Learning: Interactions between the Data,
Model, and Inference Algorithm [33.59320315666675]
We study the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality.
We find that learning is most efficient when these symmetries are compatible with those of the data distribution.
arXiv Detail & Related papers (2022-07-11T04:08:21Z) - Learning sparse features can lead to overfitting in neural networks [9.2104922520782]
We show that feature learning can perform worse than lazy training.
Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth.
arXiv Detail & Related papers (2022-06-24T14:26:33Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - High-dimensional separability for one- and few-shot learning [58.8599521537]
This work is driven by a practical question, corrections of Artificial Intelligence (AI) errors.
Special external devices, correctors, are developed. They should provide quick and non-iterative system fix without modification of a legacy AI system.
New multi-correctors of AI systems are presented and illustrated with examples of predicting errors and learning new classes of objects by a deep convolutional neural network.
arXiv Detail & Related papers (2021-06-28T14:58:14Z) - Unsupervised mapping of phase diagrams of 2D systems from infinite
projected entangled-pair states via deep anomaly detection [0.0]
We demonstrate how to map out the phase diagram of a two dimensional quantum many body system with no prior physical knowledge.
As a benchmark, the phase diagram of the 2D frustrated bilayer Heisenberg model is analyzed.
We show that in order to get a good qualitative picture of the transition lines, it suffices to use data from the cost-efficient simple update optimization.
arXiv Detail & Related papers (2021-05-19T12:19:20Z) - A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation.
Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z) - Recurrent Multi-view Alignment Network for Unsupervised Surface
Registration [79.72086524370819]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data.
We propose to represent the non-rigid transformation with a point-wise combination of several rigid transformations.
We also introduce a differentiable loss function that measures the 3D shape similarity on the projected multi-view 2D depth images.
arXiv Detail & Related papers (2020-11-24T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.