Related papers: Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis

Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis

URL: http://arxiv.org/abs/2601.11789v1
Date: Fri, 16 Jan 2026 21:32:48 GMT
Title: Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis
Authors: Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, Yaoqing Yang,
Abstract summary: This paper explores the suspicious alignment phenomenon in gradient descent (SGD) under ill-conditioned optimization.<n>Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease.<n>We show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss.
Score: 30.6120085647449
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious'' because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size $η_t^*$ separates alignment-decreasing ($η_t < η_t^*$) from alignment-increasing ($η_t > η_t^*$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.

Related papers

SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers [16.976750197698063]
We introduce SPINAL, a diagnostic that measures how alignment reshapes representations across depth.<n>Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks.<n>Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass.
arXiv Detail & Related papers (2026-01-08T17:47:12Z)
Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z)
Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares [33.60489399178793]
gradient descent on neural networks is frequently performed in a large step size regime called the edge of stability"<n>We provide convergence rates for gradient descent with large learning rates in an overparametrised least squares setting.
arXiv Detail & Related papers (2025-10-20T13:02:41Z)
SGD Convergence under Stepsize Shrinkage in Low-Precision Training [0.0]
quantizing gradient shrinkage introduces magnitude shrinkage, which can change how gradient descent converges.<n>We show that this shrinkage affect the usual stepsize ( mu_k q_k ) with an effective stepsize ( mu_k q_k )<n>We prove that low-precision SGD still converges, but at a slower pace set by ( q_min ) and with a higher steady error level due to quantization effects.
arXiv Detail & Related papers (2025-08-10T02:25:48Z)
Accelerating Neural Network Training Along Sharp and Flat Directions [6.576051895863941]
We study Bulk-SGD, a variant of SGD that restricts updates to the complement of the Dominant subspace.<n>We show that updates along the Bulk subspace, corresponding to flatter directions in the loss landscape, can accelerate convergence but may compromise stability.<n>Our findings suggest a principled approach to designing curvature-awares.
arXiv Detail & Related papers (2025-05-17T12:13:05Z)
On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.