Related papers: Optimizing Mode Connectivity via Neuron Alignment

Optimizing Mode Connectivity via Neuron Alignment

URL: http://arxiv.org/abs/2009.02439v2
Date: Mon, 2 Nov 2020 23:56:40 GMT
Title: Optimizing Mode Connectivity via Neuron Alignment
Authors: N. Joseph Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, Rongjie Lai
Abstract summary: Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant. We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
Score: 84.26606622400423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The loss landscapes of deep neural networks are not well understood due to their high nonconvexity. Empirically, the local minima of these loss functions can be connected by a learned curve in model space, along which the loss remains nearly constant; a feature known as mode connectivity. Yet, current curve finding algorithms do not consider the influence of symmetry in the loss surface created by model weight permutations. We propose a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected. To approximate the optimal permutation, we introduce an inexpensive heuristic referred to as neuron alignment. Neuron alignment promotes similarity between the distribution of intermediate activations of models along the curve. We provide theoretical analysis establishing the benefit of alignment to mode connectivity based on this simple heuristic. We empirically verify that the permutation given by alignment is locally optimal via a proximal alternating minimization scheme. Empirically, optimizing the weight permutation is critical for efficiently learning a simple, planar, low-loss curve between networks that successfully generalizes. Our alignment method can significantly alleviate the recently identified robust loss barrier on the path connecting two adversarial robust models and find more robust and accurate models on the path.

Related papers

Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Weight transport through spike timing for robust local gradients [0.5236468296934584]
plasticity in functional neural networks is frequently expressed as gradient descent on a cost. This imposes symmetry constraints that are difficult to reconcile with local computation. We introduce spike-based alignment learning, which uses spike timing statistics to extract and correct the asymmetry between effective reciprocal connections.
arXiv Detail & Related papers (2025-03-04T14:05:39Z)
Improving Generalization of Deep Neural Networks by Optimum Shifting [33.092571599896814]
We propose a novel method called emphoptimum shifting, which changes the parameters of a neural network from a sharp minimum to a flatter one. Our method is based on the observation that when the input and output of a neural network are fixed, the matrix multiplications within the network can be treated as systems of under-determined linear equations.
arXiv Detail & Related papers (2024-05-23T02:31:55Z)
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy [75.15685966213832]
We analyze the rich directional structure of optimization trajectories represented by their pointwise parameters. We show that training only scalar batchnorm parameters some while into training matches the performance of training the entire network.
arXiv Detail & Related papers (2024-03-12T07:32:47Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks [21.69222364939501]
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent from zero. We show that the path interpolates continuously between the so-called kernel and rich regimes.
arXiv Detail & Related papers (2023-03-31T05:32:11Z)
Semantic Strengthening of Neuro-Symbolic Learning [85.6195120593625]
Neuro-symbolic approaches typically resort to fuzzy approximations of a probabilistic objective. We show how to compute this efficiently for tractable circuits. We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a minimum-cost perfect matching, and solving Sudoku puzzles.
arXiv Detail & Related papers (2023-02-28T00:04:22Z)
Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to. We use influence functions to derive suitable expressions of the population loss and its lower bound. Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z)
Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function. We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z)
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks. In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z)
On the Stability Properties and the Optimization Landscape of Training Problems with Squared Loss for Neural Networks and General Nonlinear Conic Approximation Schemes [0.0]
We study the optimization landscape and the stability properties of training problems with squared loss for neural networks and general nonlinear conic approximation schemes. We prove that the same effects that are responsible for these instability properties are also the reason for the emergence of saddle points and spurious local minima.
arXiv Detail & Related papers (2020-11-06T11:34:59Z)
Neural Control Variates [71.42768823631918]
We show that a set of neural networks can face the challenge of finding a good approximation of the integrand. We derive a theoretically optimal, variance-minimizing loss function, and propose an alternative, composite loss for stable online training in practice. Specifically, we show that the learned light-field approximation is of sufficient quality for high-order bounces, allowing us to omit the error correction and thereby dramatically reduce the noise at the cost of negligible visible bias.
arXiv Detail & Related papers (2020-06-02T11:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.