Related papers: Feature selection with gradient descent on two-layer networks in low-rotation regimes

Feature selection with gradient descent on two-layer networks in low-rotation regimes

URL: http://arxiv.org/abs/2208.02789v1
Date: Thu, 4 Aug 2022 17:43:36 GMT
Title: Feature selection with gradient descent on two-layer networks in low-rotation regimes
Authors: Matus Telgarsky
Abstract summary: This work establishes low test error of gradient flow (GF) and gradient descent gradient (SGD) on two-layer ReLU networks. It makes use of margins as the core analytic technique.
Score: 20.41989568533313
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.

Related papers

A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models [6.734175048463699]
We derive a linear convergence rate for gradient descent for two-layer linear neural networks trained with squared loss.<n>Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size.
arXiv Detail & Related papers (2025-05-16T19:57:22Z)
Precise gradient descent training dynamics for finite-width multi-layer neural networks [8.057006406834466]
We provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks.<n>Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights.
arXiv Detail & Related papers (2025-05-08T02:19:39Z)
Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z)
Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z)
Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance. We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary. We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z)
Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width. The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
$\texttt{FedBC}$: Calibrating Global and Local Models via Federated Learning Beyond Consensus [66.62731854746856]
In federated learning (FL), the objective of collaboratively learning a global model through aggregation of model updates across devices tends to oppose the goal of personalization via local information. In this work, we calibrate this tradeoff in a quantitative manner through a multi-criterion-based optimization. We demonstrate that $texttFedBC$ balances the global and local model test accuracy metrics across a suite datasets.
arXiv Detail & Related papers (2022-06-22T02:42:04Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
The Sample Complexity of One-Hidden-Layer Neural Networks [57.6421258363243]
We study a class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We prove that controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees. We analyze two important settings where a mere spectral norm control turns out to be sufficient.
arXiv Detail & Related papers (2022-02-13T07:12:02Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.