Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits
- URL: http://arxiv.org/abs/2603.02417v1
- Date: Mon, 02 Mar 2026 21:57:09 GMT
- Title: Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits
- Authors: Daniel Zantedeschi, Kumar Muthuraman,
- Abstract summary: We develop a theory of gradient descent in which mini-batch noise is an intrinsic, loss-induced matrix.<n>We prove oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix -- not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.
Related papers
- Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity [9.668478511115683]
We propose a minimum Wasserstein distance estimation framework that avoids explicit modeling of outcome regressions or importance weights.<n>The resulting W-estimator admits a closed-form expression and is numerically equivalent to a classical 1-nearest neighbor estimator.<n> Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.
arXiv Detail & Related papers (2026-01-12T07:36:44Z) - Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting [44.23640219583819]
Reinforced Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting.<n>We propose Entropy-Adaptive Fine-Tuning (EAFT) to solve this problem.<n>EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.
arXiv Detail & Related papers (2026-01-05T14:28:17Z) - Bregman geometry-aware split Gibbs sampling for Bayesian Poisson inverse problems [8.115032818930457]
We propose a novel framework for solving inverse problems by a Monte Carlo sampling algorithm.<n>We show that the method achieves competitive performance in terms of reconstruction quality.
arXiv Detail & Related papers (2025-11-15T15:27:31Z) - Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z) - Spectral Thresholds for Identifiability and Stability:Finite-Sample Phase Transitions in High-Dimensional Learning [0.0]
In high-dimensional learning, models remain stable until they collapse abruptly once the sample size falls below a critical level.<n>Our Fisher Threshold Theorem formalizes this by proving that requires the minimal Fisher eigenvalue to exceed an explicit $O(sqrtd/n)$ bound.<n>Unlike prior or model-specific criteria, this threshold is finite-sample and necessary, marking a sharp phase transition between reliable concentration and inevitable failure.
arXiv Detail & Related papers (2025-10-04T13:33:48Z) - A Computable Measure of Suboptimality for Entropy-Regularised Variational Objectives [17.212481754312048]
Several emerging post-Bayesian methods target a probability distribution for which an entropy-regularised variational objective is minimised.<n>This increased flexibility introduces a computational challenge, as one loses access to an explicit unnormalised density for the target.<n>We introduce a novel measure of suboptimality called 'gradient discrepancy', and in particular a'Kernel gradient discrepancy' that can be explicitly computed.
arXiv Detail & Related papers (2025-09-12T16:38:41Z) - Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z) - A Geometric Unification of Distributionally Robust Covariance Estimators: Shrinking the Spectrum by Inflating the Ambiguity Set [20.166217494056916]
We propose a principled approach to construct covariance estimators without imposing restrictive assumptions.<n>We show that our robust estimators are efficiently computable and consistent.<n> Numerical experiments based on synthetic and real data show that our robust estimators are competitive with state-of-the-art estimators.
arXiv Detail & Related papers (2024-05-30T15:01:18Z) - Variation Due to Regularization Tractably Recovers Bayesian Deep Learning [44.16006844888796]
We propose an uncertainty quantification method for large networks based on variation due to regularization.<n>We show that regularization variation (RegVar) provides rigorous uncertainty estimates that, in the infinitesimal limit, exactly recover the Laplace approximation in Bayesian deep learning.<n>Our experiments across multiple datasets show that RegVar not only identifies uncertain predictions effectively but also provides insights into the stability of learned representations.
arXiv Detail & Related papers (2024-03-15T20:47:39Z) - Efficiently Escaping Saddle Points for Policy Optimization [43.636107996849375]
Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance.<n>We propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $tildeO(epsilon-3)$.
arXiv Detail & Related papers (2023-11-15T12:36:45Z) - Adaptive Annealed Importance Sampling with Constant Rate Progress [68.8204255655161]
Annealed Importance Sampling (AIS) synthesizes weighted samples from an intractable distribution.
We propose the Constant Rate AIS algorithm and its efficient implementation for $alpha$-divergences.
arXiv Detail & Related papers (2023-06-27T08:15:28Z) - Bayesian Renormalization [68.8204255655161]
We present a fully information theoretic approach to renormalization inspired by Bayesian statistical inference.
The main insight of Bayesian Renormalization is that the Fisher metric defines a correlation length that plays the role of an emergent RG scale.
We provide insight into how the Bayesian Renormalization scheme relates to existing methods for data compression and data generation.
arXiv Detail & Related papers (2023-05-17T18:00:28Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption.
They can suffer from ill-posedness and convergence instability.
This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.