Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods
- URL: http://arxiv.org/abs/2504.14701v1
- Date: Sun, 20 Apr 2025 18:29:39 GMT
- Title: Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods
- Authors: Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig,
- Abstract summary: We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics.<n>Our experiments reveal an *overlap* between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level.<n>Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.
- Score: 22.835933033524718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, it has been observed that when training a deep neural net with SGD, the majority of the loss landscape's curvature quickly concentrates in a tiny *top* eigenspace of the loss Hessian, which remains largely stable thereafter. Independently, it has been shown that successful magnitude pruning masks for deep neural nets emerge early in training and remain stable thereafter. In this work, we study these two phenomena jointly and show that they are connected: We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics. We identify *overlap* as the most useful such metric due to its interpretability and stability. To compute *overlap*, we develop a matrix-free algorithm based on sketched SVDs that allows us to compute over 1000 Hessian eigenpairs for nets with over 10M parameters --an unprecedented scale by several orders of magnitude. Our experiments reveal an *overlap* between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that *top Hessian eigenvectors tend to be concentrated around larger parameters*, or equivalently, that *larger parameters tend to align with directions of larger loss curvature*. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.
Related papers
- A Bayesian Approach Toward Robust Multidimensional Ellipsoid-Specific Fitting [0.0]
This work presents a novel and effective method for fitting multidimensional ellipsoids to scattered data in the contamination of noise and outliers.
We incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain.
We apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks.
arXiv Detail & Related papers (2024-07-27T14:31:51Z) - Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian.
We find that certain spectral properties under $mu$P are largely independent of the size of the network.
We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Hessian Eigenvectors and Principal Component Analysis of Neural Network
Weight Matrices [0.0]
This study delves into the intricate dynamics of trained deep neural networks and their relationships with network parameters.
We unveil a correlation between Hessian eigenvectors and network weights.
This relationship, hinging on the magnitude of eigenvalues, allows us to discern parameter directions within the network.
arXiv Detail & Related papers (2023-11-01T11:38:31Z) - Initialization Matters: Privacy-Utility Analysis of Overparameterized
Neural Networks [72.51255282371805]
We prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets.
We find that this KL privacy bound is largely determined by the expected squared gradient norm relative to model parameters during training.
arXiv Detail & Related papers (2023-10-31T16:13:22Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - What does a deep neural network confidently perceive? The effective
dimension of high certainty class manifolds and their low confidence
boundaries [53.45325448933401]
Deep neural network classifiers partition input space into high confidence regions for each class.
We exploit the notions of Gaussian width and Gordon's escape theorem to tractably estimate the effective dimension of CMs.
We show several connections between the dimension of CMs, generalization, and robustness.
arXiv Detail & Related papers (2022-10-11T15:42:06Z) - Inferring Structural Parameters of Low-Surface-Brightness-Galaxies with
Uncertainty Quantification using Bayesian Neural Networks [70.80563014913676]
We show that a Bayesian Neural Network (BNN) can be used for the inference, with uncertainty, of such parameters from simulated low-surface-brightness galaxy images.
Compared to traditional profile-fitting methods, we show that the uncertainties obtained using BNNs are comparable in magnitude, well-calibrated, and the point estimates of the parameters are closer to the true values.
arXiv Detail & Related papers (2022-07-07T17:55:26Z) - Deep learning, stochastic gradient descent and diffusion maps [0.0]
gradient descent (SGD) is widely used in deep learning due to its computational efficiency.
It has been observed that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep networks are close to zero.
Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold.
arXiv Detail & Related papers (2022-04-04T10:19:39Z) - Exploring the Common Principal Subspace of Deep Features in Neural
Networks [50.37178960258464]
We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces.
Specifically, we design a new metric $mathcalP$-vector to represent the principal subspace of deep features learned in a DNN.
Small angles (with cosine close to $1.0$) have been found in the comparisons between any two DNNs trained with different algorithms/architectures.
arXiv Detail & Related papers (2021-10-06T15:48:32Z) - Sketchy Empirical Natural Gradient Methods for Deep Learning [20.517823521066234]
We develop an efficient sketchy empirical gradient method (SENG) for large-scale deep learning problems.
A distributed version of SENG is also developed for extremely large-scale applications.
On the task ResNet50 with ImageNet-1k, SENG achieves 75.9% Top-1 testing accuracy within 41 epochs.
arXiv Detail & Related papers (2020-06-10T16:17:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.