Related papers: High-Dimensional Learning Dynamics of Quantized Models with Straight-Through Estimator

High-Dimensional Learning Dynamics of Quantized Models with Straight-Through Estimator

URL: http://arxiv.org/abs/2510.10693v1
Date: Sun, 12 Oct 2025 16:43:46 GMT
Title: High-Dimensional Learning Dynamics of Quantized Models with Straight-Through Estimator
Authors: Yuma Ichikawa, Shuhei Kashiwamura, Ayaka Sakata,
Abstract summary: Quantized neural network training optimize a discrete, non-differentiable objective.<n>The straight-through estimator (STE) enables backpropagation through surrogate gradients.<n>We theoretically show that in the high-dimensional limit, STE dynamics converge to an ordinary deterministic differential equation.
Score: 7.837881800517111
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantized neural network training optimizes a discrete, non-differentiable objective. The straight-through estimator (STE) enables backpropagation through surrogate gradients and is widely used. While previous studies have primarily focused on the properties of surrogate gradients and their convergence, the influence of quantization hyperparameters, such as bit width and quantization range, on learning dynamics remains largely unexplored. We theoretically show that in the high-dimensional limit, STE dynamics converge to a deterministic ordinary differential equation. This reveals that STE training exhibits a plateau followed by a sharp drop in generalization error, with plateau length depending on the quantization range. A fixed-point analysis quantifies the asymptotic deviation from the unquantized linear model. We also extend analytical techniques for stochastic gradient descent to nonlinear transformations of weights and inputs.

Related papers

Learning under Quantization for High-Dimensional Linear Regression [34.214978824165236]
Low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models.<n>Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent.<n>We present the first systematic theoretical study of this fundamental question, analyzing finite-step gradient descent (SGD) for high-dimensional linear regression.
arXiv Detail & Related papers (2025-10-21T03:30:11Z)
Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization [2.8948274245812327]
This work presents the first finite-sample analysis of the straight-through estimator (STE) in the context of neural network quantization.<n>Specifically, by analyzing the quantization-aware training of a two-layer neural network with binary weights and activations, we derive the sample complexity bound in terms of the data dimensionality.<n>In the presence of label noises, we uncover an intriguing recurrence property of STE-gradient method, where the iterate repeatedly escape from and return to the optimal binary weights.
arXiv Detail & Related papers (2025-05-23T17:11:22Z)
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models [18.99511760351873]
Recent experiments have shown that training trajectories of multiple deep neural networks evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold.<n>Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter.<n>We show that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent.
arXiv Detail & Related papers (2025-05-13T19:20:19Z)
Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit [1.7597525104451157]
An empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE) Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs) We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings.
arXiv Detail & Related papers (2024-06-11T03:07:41Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Robust Implicit Regularization via Weight Normalization [5.37610807422229]
We show that weight normalization enables a robust bias that persists even if the weights are at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically by using weight normalization.
arXiv Detail & Related papers (2023-05-09T13:38:55Z)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD) We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z)
Global Convergence of Over-parameterized Deep Equilibrium Models [52.65330015267245]
A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection. Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation. We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
arXiv Detail & Related papers (2022-05-27T08:00:13Z)
A Dynamical Central Limit Theorem for Shallow Neural Networks [48.66103132697071]
We prove that the fluctuations around the mean limit remain bounded in mean square throughout training. If the mean-field dynamics converges to a measure that interpolates the training data, we prove that the deviation eventually vanishes in the CLT scaling.
arXiv Detail & Related papers (2020-08-21T18:00:50Z)
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve. We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z)
Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been achieved with empirical straight-through (ST) approaches. At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.