Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and
Sparsity
- URL: http://arxiv.org/abs/2205.15809v1
- Date: Tue, 31 May 2022 14:10:15 GMT
- Title: Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and
Sparsity
- Authors: Arthur Jacot, Eugene Golikov, Cl\'ement Hongler, Franck Gabriel
- Abstract summary: We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_ell$ of the training set.
This reformulation reveals the dynamics behind feature learning.
- Score: 9.077741848403791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the loss surface of DNNs with $L_{2}$ regularization. We show that
the loss in terms of the parameters can be reformulated into a loss in terms of
the layerwise activations $Z_{\ell}$ of the training set. This reformulation
reveals the dynamics behind feature learning: each hidden representations
$Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and
interpolate between the input and output representations, keeping as little
information from the input as necessary to construct the activation of the next
layer. For positively homogeneous non-linearities, the loss can be further
reformulated in terms of the covariances of the hidden representations, which
takes the form of a partially convex optimization over a convex cone.
This second reformulation allows us to prove a sparsity result for
homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be
achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the
size of the training set). We show that this bound is tight by giving an
example of a local minimum which requires $N^{2}/4$ hidden neurons. But we also
observe numerically that in more traditional settings much less than $N^{2}$
neurons are required to reach the minima.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers [1.189367612437469]
We explicitly determine local and global minimizers of the $mathcalL2$ cost function in underparametrized Deep Learning (DL) networks.
arXiv Detail & Related papers (2023-09-19T14:20:55Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Spatially heterogeneous learning by a deep student machine [0.0]
Deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes.
We study supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$ perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting.
We show that the problem becomes exactly solvable in what we call as 'dense limit': $N gg c gg 1$ and $M gg 1$ with fixed $alpha=M/c$ using the replica method developed in (H. Yoshino, (
arXiv Detail & Related papers (2023-02-15T01:09:03Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work [59.29606307518154]
We show that as long as the width $m geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss.
We also consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
arXiv Detail & Related papers (2022-10-21T14:41:26Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.