Related papers: Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

URL: http://arxiv.org/abs/2304.01335v2
Date: Tue, 18 Apr 2023 06:25:31 GMT
Title: Charting the Topography of the Neural Network Landscape with Thermal-Like Noise
Authors: Theo Jules, Gal Brener, Tal Kachman, Noam Levi, Yohai Bar-Sinai
Abstract summary: Training neural networks is a complex, high-dimensional, non- quadratic and noisy optimization problem. We use Langevin dynamics methods to study a classification task on random data network. We find that it is a low-dimensional dimension can be readily obtained from the fluctuations. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The training of neural networks is a complex, high-dimensional, non-convex and noisy optimization problem whose theoretical understanding is interesting both from an applicative perspective and for fundamental reasons. A core challenge is to understand the geometry and topography of the landscape that guides the optimization. In this work, we employ standard Statistical Mechanics methods, namely, phase-space exploration using Langevin dynamics, to study this landscape for an over-parameterized fully connected network performing a classification task on random data. Analyzing the fluctuation statistics, in analogy to thermal dynamics at a constant temperature, we infer a clear geometric description of the low-loss region. We find that it is a low-dimensional manifold whose dimension can be readily obtained from the fluctuations. Furthermore, this dimension is controlled by the number of data points that reside near the classification decision boundary. Importantly, we find that a quadratic approximation of the loss near the minimum is fundamentally inadequate due to the exponential nature of the decision boundary and the flatness of the low-loss region. This causes the dynamics to sample regions with higher curvature at higher temperatures, while producing quadratic-like statistics at any given temperature. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.

Related papers

Maximum entropy based testing in network models: ERGMs and constrained optimization [1.9116784879310027]
We develop a constrained entropy-maximization problem on the space of networks.<n>The resulting test statistics are defined through the Lagrange multipliers associated with the constrained optimization problem.<n>We show that the proposed Lagrange-multiplier framework connects naturally to classical score tests for constrained maximum likelihood estimation.
arXiv Detail & Related papers (2026-02-24T12:35:08Z)
GKNet: Graph Kalman Filtering and Model Inference via Model-based Deep Learning [10.609815608017065]
Inference tasks with time series over graphs are of importance in applications such as urban water networks, economics, and networked neuroscience.<n>We propose a graph-aware state space model for graph time series, where both the latent state and the observation equation are parametric graph-induced models with a limited number of parameters that need to be learned.
arXiv Detail & Related papers (2025-06-27T08:17:07Z)
Curvature in the Looking-Glass: Optimal Methods to Exploit Curvature of Expectation in the Loss Landscape [41.94295877935867]
We present a new conceptual framework to understand how curvature of expected changes in loss emerges in architectures with many rectified linear units. Our derivations show how these discontinuities combine to form a glass-like structure, similar to amorphous solids that contain microscopic domains of strong, but random, atomic alignment. We derive the optimal modification to quasi-Newton steps that incorporate both glass and Hessian terms, as well as certain exactness properties that are possible with Nesterov-accelerated gradient updates.
arXiv Detail & Related papers (2024-11-25T20:32:57Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Information-Theoretic Thresholds for Planted Dense Cycles [52.076657911275525]
We study a random graph model for small-world networks which are ubiquitous in social and biological sciences. For both detection and recovery of the planted dense cycle, we characterize the information-theoretic thresholds in terms of $n$, $tau$, and an edge-wise signal-to-noise ratio $lambda$.
arXiv Detail & Related papers (2024-02-01T03:39:01Z)
On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z)
Dynamic Causal Explanation Based Diffusion-Variational Graph Neural Network for Spatio-temporal Forecasting [60.03169701753824]
We propose a novel Dynamic Diffusion-al Graph Neural Network (DVGNN) fortemporal forecasting. The proposed DVGNN model outperforms state-of-the-art approaches and achieves outstanding Root Mean Squared Error result.
arXiv Detail & Related papers (2023-05-16T11:38:19Z)
A physics and data co-driven surrogate modeling approach for temperature field prediction on irregular geometric domain [12.264200001067797]
We propose a novel physics and data co-driven surrogate modeling method for temperature field prediction. Numerical results demonstrate that our method can significantly improve accuracy prediction on a smaller dataset.
arXiv Detail & Related papers (2022-03-15T08:43:24Z)
Physics-informed Convolutional Neural Networks for Temperature Field Prediction of Heat Source Layout without Labeled Data [9.71214034180507]
This paper develops a physics-informed convolutional neural network (CNN) for the thermal simulation surrogate. The network can learn a mapping from heat source layout to the steady-state temperature field without labeled data, which equals solving an entire family of partial difference equations (PDEs)
arXiv Detail & Related papers (2021-09-26T03:24:23Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD) We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z)
Towards Deeper Graph Neural Networks [63.46470695525957]
Graph convolutions perform neighborhood aggregation and represent one of the most important graph operations. Several recent studies attribute this performance deterioration to the over-smoothing issue. We propose Deep Adaptive Graph Neural Network (DAGNN) to adaptively incorporate information from large receptive fields.
arXiv Detail & Related papers (2020-07-18T01:11:14Z)
A Near-Optimal Gradient Flow for Learning Neural Energy-Based Models [93.24030378630175]
We propose a novel numerical scheme to optimize the gradient flows for learning energy-based models (EBMs) We derive a second-order Wasserstein gradient flow of the global relative entropy from Fokker-Planck equation. Compared with existing schemes, Wasserstein gradient flow is a smoother and near-optimal numerical scheme to approximate real data densities.
arXiv Detail & Related papers (2019-10-31T02:26:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.