The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks
- URL: http://arxiv.org/abs/2210.03820v1
- Date: Fri, 7 Oct 2022 21:14:09 GMT
- Title: The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks
- Authors: Daniel Kunin, Atsushi Yamamura, Chao Ma, Surya Ganguli
- Abstract summary: We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations.
We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally.
- Score: 26.58848653965855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore the maximum-margin bias of quasi-homogeneous neural
networks trained with gradient flow on an exponential loss and past a point of
separability. We introduce the class of quasi-homogeneous models, which is
expressive enough to describe nearly all neural networks with homogeneous
activations, even those with biases, residual connections, and normalization
layers, while structured enough to enable geometric analysis of its gradient
dynamics. Using this analysis, we generalize the existing results of
maximum-margin bias for homogeneous networks to this richer class of models. We
find that gradient flow implicitly favors a subset of the parameters, unlike in
the case of a homogeneous model where all parameters are treated equally. We
demonstrate through simple examples how this strong favoritism toward
minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous
models. On the other hand, we conjecture that this norm-minimization discards,
when possible, unnecessary higher-order parameters, reducing the model to a
sparser parameterization. Lastly, by applying our theorem to sufficiently
expressive neural networks with normalization layers, we reveal a universal
mechanism behind the empirical phenomenon of Neural Collapse.
Related papers
- Generalization for Least Squares Regression With Simple Spiked Covariances [3.9134031118910264]
The generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood.
Recent work has made progress by describing the spectrum of the feature matrix at the hidden layer.
Yet, the generalization error for linear models with spiked covariances has not been previously determined.
arXiv Detail & Related papers (2024-10-17T19:46:51Z) - Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures [93.17009514112702]
Pruning, setting a significant subset of the parameters of a neural network to zero, is one of the most popular methods of model compression.
Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood.
arXiv Detail & Related papers (2023-04-25T07:42:06Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Understanding Double Descent Requires a Fine-Grained Bias-Variance
Decomposition [34.235007566913396]
We describe an interpretable, symmetric decomposition of the variance into terms associated with the labels.
We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior.
We also analyze the strikingly rich phenomenology that arises.
arXiv Detail & Related papers (2020-11-04T21:04:02Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z) - Hyperbolic Neural Networks++ [66.16106727715061]
We generalize the fundamental components of neural networks in a single hyperbolic geometry model, namely, the Poincar'e ball model.
Experiments show the superior parameter efficiency of our methods compared to conventional hyperbolic components, and stability and outperformance over their Euclidean counterparts.
arXiv Detail & Related papers (2020-06-15T08:23:20Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - Learning CHARME models with neural networks [1.5362025549031046]
We consider a model called CHARME (Conditional Heteroscedastic Autoregressive Mixture of Experts)
As an application, we develop a learning theory for the NN-based autoregressive functions of the model.
arXiv Detail & Related papers (2020-02-08T21:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.