Just How Flexible are Neural Networks in Practice?
- URL: http://arxiv.org/abs/2406.11463v1
- Date: Mon, 17 Jun 2024 12:24:45 GMT
- Title: Just How Flexible are Neural Networks in Practice?
- Authors: Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson,
- Abstract summary: It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
- Score: 89.80474583606242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.
Related papers
- Evolutionary Retrofitting [42.21143557577615]
AfterLearnER consists in applying non-differentiable optimization, including evolutionary methods, to fully-trained machine learning models.
The efficiency of AfterLearnER is demonstrated by tackling non-differentiable signals such as threshold-based criteria in depth sensing, the word error rate in speech re-synthesis, image quality in 3D generative adversarial networks (GANs)
The advantages of AfterLearnER are its versatility (no gradient is needed), the possibility to use non-differentiable feedback including human evaluations, the limited overfitting, supported by a theoretical study and its anytime behavior.
arXiv Detail & Related papers (2024-10-15T06:59:32Z) - Pruning Convolutional Filters via Reinforcement Learning with Entropy
Minimization [0.0]
We introduce a novel information-theoretic reward function which minimizes the spatial entropy of convolutional activations.
Our method shows that there is another possibility to preserve accuracy without the need to directly optimize it in the agent's reward function.
arXiv Detail & Related papers (2023-12-08T09:34:57Z) - More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory [12.689249854199982]
We show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples.
We then demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory.
arXiv Detail & Related papers (2023-11-24T18:27:41Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Adaptive Self-supervision Algorithms for Physics-informed Neural
Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function.
We study the impact of the location of the collocation points on the trainability of these models.
We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z) - GOALS: Gradient-Only Approximations for Line Searches Towards Robust and
Consistent Training of Deep Neural Networks [0.0]
Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost.
We propose a gradient-only approximation line search (GOALS) with strong convergence characteristics with defined optimality criterion.
arXiv Detail & Related papers (2021-05-23T11:21:01Z) - Adaptive Low-Rank Factorization to regularize shallow and deep neural
networks [9.607123078804959]
We use Low-Rank matrix Factorization (LRF) to drop out some parameters of the learning model along the training process.
The best results of AdaptiveLRF on SVHN and CIFAR-10 datasets are 98%, 94.1% F-measure, and 97.9%, 94% accuracy.
arXiv Detail & Related papers (2020-05-05T08:13:30Z) - Parameter-Efficient Transfer from Sequential Behaviors for User Modeling
and Recommendation [111.44445634272235]
In this paper, we develop a parameter efficient transfer learning architecture, termed as PeterRec.
PeterRec allows the pre-trained parameters to remain unaltered during fine-tuning by injecting a series of re-learned neural networks.
We perform extensive experimental ablation to show the effectiveness of the learned user representation in five downstream tasks.
arXiv Detail & Related papers (2020-01-13T14:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.