Related papers: When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work

When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work

URL: http://arxiv.org/abs/2210.12001v1
Date: Fri, 21 Oct 2022 14:41:26 GMT
Title: When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work
Authors: Jiawei Zhang, Yushun Zhang, Mingyi Hong, Ruoyu Sun, Zhi-Quan Luo
Abstract summary: We show that as long as the width $m geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. We also consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
Score: 59.29606307518154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than $n$ (sample size) neurons when the activation is smooth. First, we prove that as long as the width $m \geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.

Related papers

Contextual Bandit Optimization with Pre-Trained Neural Networks [0.0]
We investigate how pre-training can help us in the regime of smaller models. We show sublinear regret of E2TC when the dimension of the last layer and number of actions $K$ are much smaller than the horizon $T$. In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit.
arXiv Detail & Related papers (2025-01-09T10:21:19Z)
Memorization Capacity for Additive Fine-Tuning with Small ReLU Networks [16.320374162259117]
Fine-Tuning Capacity (FTC) is defined as the maximum number of samples a neural network can fine-tune. We show that $N$ samples can be fine-tuned with $m=Theta(N)$ neurons for 2-layer networks, and with $m=Theta(sqrtN)$ neurons for 3-layer networks, no matter how large $K$ is.
arXiv Detail & Related papers (2024-08-01T07:58:51Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity [9.077741848403791]
We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_ell$ of the training set. This reformulation reveals the dynamics behind feature learning.
arXiv Detail & Related papers (2022-05-31T14:10:15Z)
Overparameterization of deep ResNet: zero loss and mean-field analysis [19.45069138853531]
Finding parameters in a deep neural network (NN) that fit data is a non optimization problem. We show that a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
arXiv Detail & Related papers (2021-05-30T02:46:09Z)
A Geometric Analysis of Neural Collapse with Unconstrained Features [40.66585948844492]
We provide the first global optimization landscape analysis of $Neural;Collapse$. This phenomenon arises in the last-layer classifiers and features of neural networks during the terminal phase of training.
arXiv Detail & Related papers (2021-05-06T00:00:50Z)
A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly. A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network. Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z)
Towards Understanding Hierarchical Learning: Benefits of Neural Representations [160.33479656108926]
In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks. We show that neural representation can achieve improved sample complexities compared with the raw input. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
arXiv Detail & Related papers (2020-06-24T02:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.