Related papers: Random initialisations performing above chance and how to find them

Random initialisations performing above chance and how to find them

URL: http://arxiv.org/abs/2209.07509v1
Date: Thu, 15 Sep 2022 17:52:54 GMT
Title: Random initialisations performing above chance and how to find them
Authors: Frederik Benzing, Simon Schug, Robert Meier, Johannes von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, Angelika Steger
Abstract summary: Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. We find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance.
Score: 22.812660025650253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes.

Related papers

Uncertainty quantification and posterior sampling for network reconstruction [0.0]
We present an efficient MCMC algorithm for sampling from posterior distributions of reconstructed networks. Our algorithm is specially suited for the inference of large and sparse networks.
arXiv Detail & Related papers (2025-03-10T18:00:14Z)
On permutation symmetries in Bayesian neural network posteriors: a variational perspective [8.310462710943971]
We show that there is essentially no loss barrier between the local solutions of gradient descent. This raises questions for approximate inference in Bayesian neural networks. We propose a matching algorithm to search for linearly connected solutions.
arXiv Detail & Related papers (2023-10-16T08:26:50Z)
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Learning through atypical ''phase transitions'' in overparameterized neural networks [0.43496401697112685]
Current deep neural networks are highly observableized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through overdense descent algorithms and achieve unexpected accuracy prediction. These are formidable challenges without generalization.
arXiv Detail & Related papers (2021-10-01T23:28:07Z)
Multivariate Deep Evidential Regression [77.34726150561087]
A new approach with uncertainty-aware neural networks shows promise over traditional deterministic methods. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks.
arXiv Detail & Related papers (2021-04-13T12:20:18Z)
Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian [48.61341260604871]
Gradient Descent (SGD) is a key component of the success of deep neural networks (DNNs) In this paper, we present a different approach by following the eigenvectors of the Hessian, which we call "ridges" We show both theoretically and experimentally that our method, called Ridge Rider (RR), offers a promising direction for a variety of challenging problems.
arXiv Detail & Related papers (2020-11-12T17:15:09Z)
Convergence of Online Adaptive and Recurrent Optimization Algorithms [0.0]
We prove local convergence of several notable descent algorithms used in machine learning. We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions.
arXiv Detail & Related papers (2020-05-12T09:48:52Z)
Online stochastic gradient descent on non-convex losses from high-dimensional inference [2.2344764434954256]
gradient descent (SGD) is a popular algorithm for optimization problems in high-dimensional tasks. In this paper we produce an estimator of non-trivial correlation from data. We illustrate our approach by applying it to a set of tasks such as phase retrieval, and estimation for generalized models.
arXiv Detail & Related papers (2020-03-23T17:34:06Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.