Related papers: Hessian Spectral Analysis at Foundation Model Scale

Hessian Spectral Analysis at Foundation Model Scale

URL: http://arxiv.org/abs/2602.00816v1
Date: Sat, 31 Jan 2026 16:57:06 GMT
Title: Hessian Spectral Analysis at Foundation Model Scale
Authors: Diego Granziol, Khurshid Juarev,
Abstract summary: We show that faithful spectral analysis of the true Hessian is tractable at frontier scale.<n>We produce the first large-scale spectral density estimates beyond the sub-10B regime.
Score: 1.9244735303181757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral probing incurs only a modest constant-factor overhead over first-order training. Crucially, direct access to the Hessian reveals that widely used block-diagonal curvature approximations can fail catastrophically, exhibiting order-one relative error and poor directional alignment even in mid-scale LLMs. Together, our results demonstrate that foundation-model Hessian spectra are both computable and qualitatively misrepresented by prevailing approximations, opening the door to principled curvature-based analysis at scale.

Related papers

Sharp Convergence Rates for Masked Diffusion Models [53.117058231393834]
We develop a total-variation based analysis for the Euler method that overcomes limitations.<n>Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees.<n>Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS.
arXiv Detail & Related papers (2026-02-26T00:47:51Z)
Universality of General Spiked Tensor Models [9.454986540713655]
We study the rank-one spiked tensor model in the high-dimensional regime.<n>We show that their high-dimensional spectral behavior and statistical limits are robust to non-Gaussian noise.
arXiv Detail & Related papers (2026-02-04T11:59:30Z)
FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions [1.7114074082429929]
Causal Disentangled Representation Learning aims to learn and disentangle low dimensional representations from observations.<n>We propose FlexCausal, a novel CDRL framework based on a block-diagonal covariance VAE.<n>Our framework ensures a precise structural correspondence between the learned latent subspaces and the ground-truth causal relations.
arXiv Detail & Related papers (2026-01-29T11:30:53Z)
The Vekua Layer: Exact Physical Priors for Implicit Neural Representations via Generalized Analytic Functions [0.0]
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for parameterizing physical fields.<n>We introduce a differentiable spectral method grounded in the Generalized Analytic theory.<n>We show that our method can effectively act as a physics-informed spectral filter.
arXiv Detail & Related papers (2025-12-11T21:57:21Z)
Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z)
On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z)
Spectral Thresholds in Correlated Spiked Models and Fundamental Limits of Partial Least Squares [15.163541835643635]
We show that Partial Least Squares (PLS) fails to recover any signal, despite detectability being possible in principle.<n>These findings clarify the theoretical limits of PLS and provide guidance for the design of reliable multi-modal inference methods in high dimensions.
arXiv Detail & Related papers (2025-10-20T14:08:58Z)
Theoretical Bounds for Stable In-Context Learning [0.0]
In-context learning (ICL) is flexible but its reliability is sensitive to prompt length.<n>This paper establishes a non-asymptotic lower bound that links the minimal number of demonstrations to ICL stability.<n>We propose a two-stage observable estimator with a one-shot calibration that produces practitioner-ready prompt-length estimates.
arXiv Detail & Related papers (2025-09-25T02:25:05Z)
Revisit CP Tensor Decomposition: Statistical Optimality and Fast Convergence [6.724750970258851]
We revisit Canonical Polyadic (CP) tensor decomposition from a statistical perspective.<n>We provide a comprehensive theoretical analysis of Alternating Least Squares (ALS) under a signal-plus-noise model.
arXiv Detail & Related papers (2025-05-29T03:42:03Z)
Guided Diffusion Sampling on Function Spaces with Applications to PDEs [112.09025802445329]
We propose a general framework for conditional sampling in PDE-based inverse problems.<n>This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning.<n>Our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines.
arXiv Detail & Related papers (2025-05-22T17:58:12Z)
On the Wasserstein Convergence and Straightness of Rectified Flow [54.580605276017096]
Rectified Flow (RF) is a generative model that aims to learn straight flow trajectories from noise to data.<n>We provide a theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution.<n>We present general conditions guaranteeing uniqueness and straightness of 1-RF, which is in line with previous empirical findings.
arXiv Detail & Related papers (2024-10-19T02:36:11Z)
Last-Iterate Convergence of Adaptive Riemannian Gradient Descent for Equilibrium Computation [52.73824786627612]
This paper establishes new convergence results for textitgeodesic strongly monotone games.<n>Our key result shows that RGD attains last-iterate linear convergence in a textitgeometry-agnostic fashion.<n>Overall, this paper presents the first geometry-agnostic last-iterate convergence analysis for games beyond the Euclidean settings.
arXiv Detail & Related papers (2023-06-29T01:20:44Z)
Efficient CDF Approximations for Normalizing Flows [64.60846767084877]
We build upon the diffeomorphic properties of normalizing flows to estimate the cumulative distribution function (CDF) over a closed region. Our experiments on popular flow architectures and UCI datasets show a marked improvement in sample efficiency as compared to traditional estimators.
arXiv Detail & Related papers (2022-02-23T06:11:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.