Symmetries, flat minima, and the conserved quantities of gradient flow
- URL: http://arxiv.org/abs/2210.17216v2
- Date: Thu, 23 Mar 2023 15:10:19 GMT
- Title: Symmetries, flat minima, and the conserved quantities of gradient flow
- Authors: Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, Nima Dehmamy
- Abstract summary: We present a framework for finding continuous symmetries in the parameter space, which carves out low-loss valleys.
To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries.
- Score: 20.12938444246729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Empirical studies of the loss landscape of deep networks have revealed that
many local minima are connected through low-loss valleys. Yet, little is known
about the theoretical origin of such valleys. We present a general framework
for finding continuous symmetries in the parameter space, which carve out
low-loss valleys. Our framework uses equivariances of the activation functions
and can be applied to different layer architectures. To generalize this
framework to nonlinear neural networks, we introduce a novel set of nonlinear,
data-dependent symmetries. These symmetries can transform a trained model such
that it performs similarly on new samples, which allows ensemble building that
improves robustness under certain adversarial attacks. We then show that
conserved quantities associated with linear symmetries can be used to define
coordinates along low-loss valleys. The conserved quantities help reveal that
using common initialization methods, gradient flow only explores a small part
of the global minimum. By relating conserved quantities to convergence rate and
sharpness of the minimum, we provide insights on how initialization impacts
convergence and generalizability.
Related papers
- Implicit Balancing and Regularization: Generalization and Convergence
Guarantees for Overparameterized Asymmetric Matrix Sensing [28.77440901439686]
A series of recent papers have begun to generalize this role for non-random Positive Semi-Defin (PSD) matrix sensing problems.
In this paper, we show that the trajectory of the gradient descent from small random measurements moves towards solutions that are both globally well.
arXiv Detail & Related papers (2023-03-24T19:05:52Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Oracle-Preserving Latent Flows [58.720142291102135]
We develop a methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset.
The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function.
The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles.
arXiv Detail & Related papers (2023-02-02T00:13:32Z) - Annihilation of Spurious Minima in Two-Layer ReLU Networks [9.695960412426672]
We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss.
We show that adding neurons can turn symmetric spurious minima into saddles.
We also prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function.
arXiv Detail & Related papers (2022-10-12T11:04:21Z) - The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks [26.58848653965855]
We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations.
We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally.
arXiv Detail & Related papers (2022-10-07T21:14:09Z) - Deep Networks on Toroids: Removing Symmetries Reveals the Structure of
Flat Regions in the Landscape Geometry [3.712728573432119]
We develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology.
We derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them.
We also find that minimizers found by variants of gradient descent can be connected by zero-error paths with a single bend.
arXiv Detail & Related papers (2022-02-07T09:57:54Z) - GELATO: Geometrically Enriched Latent Model for Offline Reinforcement
Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods.
In this work, we demonstrate the benefit of combining the two in a latent variational model.
Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z) - Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant.
We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z) - Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable
Neural Distribution Alignment [52.02794488304448]
We propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows.
We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
arXiv Detail & Related papers (2020-03-26T22:10:04Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - On the Principle of Least Symmetry Breaking in Shallow ReLU Models [13.760721677322072]
We show that the emphleast loss of symmetry with respect to the target weights may apply to a broader range of settings.
Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.
arXiv Detail & Related papers (2019-12-26T22:04:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.