Related papers: Chebyshev Moment Regularization (CMR): Condition-Number Control with Moment Shaping

Chebyshev Moment Regularization (CMR): Condition-Number Control with Moment Shaping

URL: http://arxiv.org/abs/2510.21772v1
Date: Fri, 17 Oct 2025 06:54:41 GMT
Title: Chebyshev Moment Regularization (CMR): Condition-Number Control with Moment Shaping
Authors: Jinwoo Baek,
Abstract summary: We introduce textbfChebyshev Moment Regularization (CMR), a simple, architecture-agnostic loss that directly optimize layer spectra.<n>CMR jointly controls spectral edges via a log-condition proxy shapes and the interior via Chebyshev moments.<n>These results support textbfoptimization-driven spectral preconditioning: directly steering models toward well-conditioned regimes for stable, accurate learning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce \textbf{Chebyshev Moment Regularization (CMR)}, a simple, architecture-agnostic loss that directly optimizes layer spectra. CMR jointly controls spectral edges via a log-condition proxy and shapes the interior via Chebyshev moments, with a decoupled, capped mixing rule that preserves task gradients. We prove strictly monotone descent for the condition proxy, bounded moment gradients, and orthogonal invariance. In an adversarial ``$\kappa$-stress'' setting (MNIST, 15-layer MLP), \emph{compared to vanilla training}, CMR reduces mean layer condition numbers by $\sim\!10^3$ (from $\approx3.9\!\times\!10^3$ to $\approx3.4$ in 5 epochs), increases average gradient magnitude, and restores test accuracy ( $\approx10\%\!\to\!\approx86\%$ ). These results support \textbf{optimization-driven spectral preconditioning}: directly steering models toward well-conditioned regimes for stable, accurate learning.

Related papers

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training [0.0]
Attention scores in transformers are bilinear forms $S_ij = x_itop M x_j / sqrtd_h$ whose maximum magnitude governs overflow risk in low-precision training.<n>We derive a emphrank-aware concentration inequality: when the interaction matrix $M = WQ WKtop$ has rank $r ll d$, tail probabilities for $max_i,j|S_ij|$ decay as $exp(-d22/(
arXiv Detail & Related papers (2026-02-21T14:29:22Z)
Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise [17.899443444882888]
We develop a worst-case complexity theory for inequalityally preconditioned gradient descent (SPSGD)<n>We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $mathcalO(T-fracp-13p-2)$ when problem parameters are known, and $mathcalO(T-fracp-12p)$ when problem parameters are unknown.<n>In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the preconditioner and the gradient estimates.
arXiv Detail & Related papers (2026-02-13T19:29:17Z)
Information Hidden in Gradients of Regression with Target Noise [2.8911861322232686]
We show that the gradients alone can reveal the Hessian.<n>We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs.
arXiv Detail & Related papers (2026-01-26T14:50:16Z)
INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers [61.84396402100827]
We propose the Indirect Neural Corrector ($mathrmINC$), which integrates learned corrections into the governing equations.<n>$mathrmINC$ reduces the error amplification on the order of $t-1 + L$, where $t$ is the timestep and $L$ the Lipschitz constant.<n>We test $mathrmINC$ in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence.
arXiv Detail & Related papers (2025-11-16T20:14:28Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
Numerical Fragility in Transformers: A Layer-wise Theory for Explaining, Forecasting, and Mitigating Instability [0.0]
We give a first-order, module-wise theory that predicts when and where errors grow.<n>For self-attention we derive a per-layer bound that factorizes into three interpretable diagnostics.<n>We also introduce a precision- and width-aware LayerNorm indicator $rho_rm LN$ with a matching first-order bound.
arXiv Detail & Related papers (2025-10-17T01:03:02Z)
OrthoGrad Improves Neural Calibration [0.0]
$perp$Grad constrains descent directions to address overconfidence.<n>$perp$Grad is a geometry-aware modification to optimization.
arXiv Detail & Related papers (2025-06-04T22:12:46Z)
On the $O(\rac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [54.28350823319057]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.<n>Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.<n>Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training. We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer. We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency. In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! . We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z)
Faster Perturbed Stochastic Gradient Methods for Finding Local Minima [92.99933928528797]
We propose tttPullback, a faster perturbed gradient framework for finding local minima. We show that Pullback with gradient estimators such as SARAH/SP and STORM can find $(epsilon, epsilon_H)$approximate local minima within $tilde O(epsilon-3 + H-6)$. The core idea of our framework is a step-size pullback'' scheme to control the average movement of the gradient evaluations.
arXiv Detail & Related papers (2021-10-25T07:20:05Z)
A New Framework for Variance-Reduced Hamiltonian Monte Carlo [88.84622104944503]
We propose a new framework of variance-reduced Hamiltonian Monte Carlo (HMC) methods for sampling from an $L$-smooth and $m$-strongly log-concave distribution. We show that the unbiased gradient estimators, including SAGA and SVRG, based HMC methods achieve highest gradient efficiency with small batch size. Experimental results on both synthetic and real-world benchmark data show that our new framework significantly outperforms the full gradient and gradient HMC approaches.
arXiv Detail & Related papers (2021-02-09T02:44:24Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Logsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte Carlo [23.781520510778716]
This is the first high-accuracy mixing time result for logconcave distributions using only first-order function information. We give evidence that dependence on $kappa$ is likely to be necessary standard Metropolized firstorder methods.
arXiv Detail & Related papers (2020-02-10T22:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.