Related papers: SGD with Dependent Data: Optimal Estimation, Regret, and Inference

SGD with Dependent Data: Optimal Estimation, Regret, and Inference

URL: http://arxiv.org/abs/2601.01371v1
Date: Sun, 04 Jan 2026 04:52:11 GMT
Title: SGD with Dependent Data: Optimal Estimation, Regret, and Inference
Authors: Yinan Shen, Yichen Zhang, Wen-Xin Zhou,
Abstract summary: gradient descent (SGD) is shown to accommodate both independent and dependent information under a broad class of stepsize schedules and exploration rate schemes.<n>We show that SGD simultaneously achieves statistically optimal estimation error and regret, extending and improving existing results.<n>For online sparse regression, we develop a new SGD-based algorithm that uses only $d$ units of storage and requires $O(d)$ flops per iteration.
Score: 3.038061705362137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the performance of the final iterate produced by stochastic gradient descent (SGD) under temporally dependent data. We consider two complementary sources of dependence: $(i)$ martingale-type dependence in both the covariate and noise processes, which accommodates non-stationary and non-mixing time series data, and $(ii)$ dependence induced by sequential decision making. Our formulation runs in parallel with classical notions of (local) stationarity and strong mixing, while neither framework fully subsumes the other. Remarkably, SGD is shown to automatically accommodate both independent and dependent information under a broad class of stepsize schedules and exploration rate schemes. Non-asymptotically, we show that SGD simultaneously achieves statistically optimal estimation error and regret, extending and improving existing results. In particular, our tail bounds remain sharp even for potentially infinite horizon $T=+\infty$. Asymptotically, the SGD iterates converge to a Gaussian distribution with only an $O_{\PP}(1/\sqrt{t})$ remainder, demonstrating that the supposed estimation-regret trade-off claimed in prior work can in fact be avoided. We further propose a new ``conic'' approximation of the decision region that allows the covariates to have unbounded support. For online sparse regression, we develop a new SGD-based algorithm that uses only $d$ units of storage and requires $O(d)$ flops per iteration, achieving the long term statistical optimality. Intuitively, each incoming observation contributes to estimation accuracy, while aggregated summary statistics guide support recovery.

Related papers

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization [62.48819955422706]
We study the long-term tail decay of SGD-based methods through the lens of large deviations theory.<n>We uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
arXiv Detail & Related papers (2026-02-05T13:41:13Z)
Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences [51.56484100374058]
We study stopping rules for gradient descent (SGD) for convex optimization.<n>We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD.<n>These are the first rigorous, time-uniform performance guarantees and finitetime $varepsilon$-optimality certificates.
arXiv Detail & Related papers (2025-12-15T09:26:45Z)
Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation [5.638124543342179]
We present the first robust temporal-difference learning with linear function approximation.<n>Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts.
arXiv Detail & Related papers (2025-10-02T07:01:41Z)
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.<n>This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.<n>It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z)
Demystifying SGD with Doubly Stochastic Gradients [13.033133586372612]
We establish the convergence properties of doubly SGD with independent minibatching and random reshuffling under general conditions.<n>We prove that random reshuffling improves the complexity dependence on the subsampling noise.
arXiv Detail & Related papers (2024-06-03T01:13:19Z)
High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z)
Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems. As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z)
Online covariance estimation for stochastic gradient descent under Markovian sampling [20.02012768403544]
Convergence rates of order $Obig(sqrtd,n-1/8(log n)1/4big)$ are established under state-dependent and state-independent Markovian sampling. Our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.
arXiv Detail & Related papers (2023-08-03T00:21:30Z)
Retire: Robust Expectile Regression in High Dimensions [3.9391041278203978]
Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data. We propose and study (penalized) robust expectile regression (retire) We show that the proposed procedure can be efficiently solved by a semismooth Newton coordinate descent algorithm.
arXiv Detail & Related papers (2022-12-11T18:03:12Z)
Scaling up Stochastic Gradient Descent for Non-convex Optimisation [5.908471365011942]
We propose a novel approach to the problem of shared parallel computation. By combining two strategies into a unified framework, DPSGD is a better trade computation framework. The potential gains can be achieved by DPSGD on a deep learning (DRL) problem (Latent Diletrichal inference) and on a deep learning (DRL) problem (advantage actor - A2C)
arXiv Detail & Related papers (2022-10-06T13:06:08Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.