On the Principle of Least Symmetry Breaking in Shallow ReLU Models
- URL: http://arxiv.org/abs/1912.11939v3
- Date: Thu, 28 Dec 2023 14:44:06 GMT
- Title: On the Principle of Least Symmetry Breaking in Shallow ReLU Models
- Authors: Yossi Arjevani, Michael Field
- Abstract summary: We show that the emphleast loss of symmetry with respect to the target weights may apply to a broader range of settings.
Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.
- Score: 13.760721677322072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the optimization problem associated with fitting two-layer ReLU
networks with respect to the squared loss, where labels are assumed to be
generated by a target network. Focusing first on standard Gaussian inputs, we
show that the structure of spurious local minima detected by stochastic
gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of
symmetry} with respect to the target weights. A closer look at the analysis
indicates that this principle of least symmetry breaking may apply to a broader
range of settings. Motivated by this, we conduct a series of experiments which
corroborate this hypothesis for different classes of non-isotropic non-product
distributions, smooth activation functions and networks with a few layers.
Related papers
- Lie Point Symmetry and Physics Informed Networks [59.56218517113066]
We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function.
Our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions.
Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.
arXiv Detail & Related papers (2023-11-07T19:07:16Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Symmetries, flat minima, and the conserved quantities of gradient flow [20.12938444246729]
We present a framework for finding continuous symmetries in the parameter space, which carves out low-loss valleys.
To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries.
arXiv Detail & Related papers (2022-10-31T10:55:30Z) - Annihilation of Spurious Minima in Two-Layer ReLU Networks [9.695960412426672]
We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss.
We show that adding neurons can turn symmetric spurious minima into saddles.
We also prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function.
arXiv Detail & Related papers (2022-10-12T11:04:21Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Sharp asymptotics on the compression of two-layer neural networks [19.683271092724937]
We study the compression of a target two-layer neural network with N nodes into a compressed network with M N nodes.
We conjecture that the optimum simplified optimization problem is achieved by taking weights on the Equi Tight Frame (ETF)
arXiv Detail & Related papers (2022-05-17T09:45:23Z) - Deep Networks on Toroids: Removing Symmetries Reveals the Structure of
Flat Regions in the Landscape Geometry [3.712728573432119]
We develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology.
We derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them.
We also find that minimizers found by variants of gradient descent can be connected by zero-error paths with a single bend.
arXiv Detail & Related papers (2022-02-07T09:57:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.