Related papers: Flat Channels to Infinity in Neural Loss Landscapes

Flat Channels to Infinity in Neural Loss Landscapes

URL: http://arxiv.org/abs/2506.14951v1
Date: Tue, 17 Jun 2025 20:04:15 GMT
Title: Flat Channels to Infinity in Neural Loss Landscapes
Authors: Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea,
Abstract summary: Loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation.<n>We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly.<n>The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
Score: 46.76940650038536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Related papers

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature [1.6773271875801752]
Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima.<n>We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks.<n>Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
arXiv Detail & Related papers (2026-02-05T12:35:13Z)
A Gap Between Decision Trees and Neural Networks [2.4140387101794283]
We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks.<n>For classification, we construct a smooth barrier score $S_A$ with finite $mathrmRmathrmTV$ whose fixed threshold $=1$ exactly recovers the box.<n>Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff.
arXiv Detail & Related papers (2026-01-07T13:40:30Z)
Displacement-Sparse Neural Optimal Transport [6.968698312185846]
Optimal transport (OT) aims to find a map $T$ that transports mass from one probability measure to another while minimizing a cost function.<n>Neural OT solvers have gained popularity in high dimensional biological applications such as drug perturbation.<n>We propose an intuitive and theoretically grounded approach to learning emphdisplacement-sparse maps within neural OT solvers.
arXiv Detail & Related papers (2025-02-03T23:44:17Z)
Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization [77.3396841985172]
We provide a unified analysis of two-timescale gradient ascent (TTGDA) for solving structured non minimax optimization problems.<n>Our contribution is to design TTGDA algorithms are effective beyond the setting.
arXiv Detail & Related papers (2024-08-21T20:14:54Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
A Minimal Control Family of Dynamical Systems for Universal Approximation [5.217870815854702]
The universal approximation property (UAP) holds a fundamental position in deep learning.<n>We show that it can approximate continuous functions on compact domains.<n>Our results reveal an underlying connection between the approximation power of neural networks and control systems.
arXiv Detail & Related papers (2023-12-20T10:36:55Z)
Near Optimal Heteroscedastic Regression with Symbiotic Learning [29.16456701187538]
We consider the problem of heteroscedastic linear regression. We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
arXiv Detail & Related papers (2023-06-25T16:32:00Z)
Neural Network Approximation of Continuous Functions in High Dimensions with Applications to Inverse Problems [6.84380898679299]
Current theory predicts that networks should scale exponentially in the dimension of the problem. We provide a general method for bounding the complexity required for a neural network to approximate a H"older (or uniformly) continuous function.
arXiv Detail & Related papers (2022-08-28T22:44:07Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations. The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z)
Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances [9.390008801320024]
We show that adding one extra neuron to each is sufficient to connect all previously discrete minima into a single manifold. We show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold.
arXiv Detail & Related papers (2021-05-25T21:19:07Z)
Nonparametric Learning of Two-Layer ReLU Residual Units [22.870658194212744]
We describe an algorithm that learns two-layer residual units with rectified linear unit (ReLU) activation. We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities. We prove the statistical strong consistency of our algorithm, and demonstrate the robustness and sample efficiency of our algorithm by experiments.
arXiv Detail & Related papers (2020-08-17T22:11:26Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems [86.92205445270427]
We consider non-con minimax problems, $min_mathbfx max_mathhidoty f(mathbfdoty)$ efficiently.
arXiv Detail & Related papers (2019-06-02T03:03:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.