Analyzing Monotonic Linear Interpolation in Neural Network Loss
Landscapes
- URL: http://arxiv.org/abs/2104.11044v2
- Date: Fri, 23 Apr 2021 17:24:48 GMT
- Title: Analyzing Monotonic Linear Interpolation in Neural Network Loss
Landscapes
- Authors: James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard
Zemel, Roger Grosse
- Abstract summary: We provide sufficient conditions for the MLI property under mean squared error.
While the MLI property holds under various settings, we produce in practice systematically violating the MLI property.
- Score: 17.222244907679997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear interpolation between initial neural network parameters and converged
parameters after training with stochastic gradient descent (SGD) typically
leads to a monotonic decrease in the training objective. This Monotonic Linear
Interpolation (MLI) property, first observed by Goodfellow et al. (2014)
persists in spite of the non-convex objectives and highly non-linear training
dynamics of neural networks. Extending this work, we evaluate several
hypotheses for this property that, to our knowledge, have not yet been
explored. Using tools from differential geometry, we draw connections between
the interpolated paths in function space and the monotonicity of the network -
providing sufficient conditions for the MLI property under mean squared error.
While the MLI property holds under various settings (e.g. network architectures
and learning problems), we show in practice that networks violating the MLI
property can be produced systematically, by encouraging the weights to move far
from initialization. The MLI property raises important questions about the loss
landscape geometry of neural networks and highlights the need to further study
their global properties.
Related papers
- Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss
Landscape for Deep Networks [18.71055320062469]
Monotonic linear (MLI) is a phenomenon that is commonly observed in the training of neural networks.
We show that the MLI property is not necessarily related to the hardness of optimization problems.
In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output.
arXiv Detail & Related papers (2022-10-03T15:33:29Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss.
We describe the convergent point of trajectories with arbitrary starting point under flow.
We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Statistical Mechanics of Deep Linear Neural Networks: The
Back-Propagating Renormalization Group [4.56877715768796]
We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear.
We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space.
Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
arXiv Detail & Related papers (2020-12-07T20:08:31Z) - Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss.
We show that kernels typically associated with the ReLU activation function have fundamental flaws.
We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.