Related papers: Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent

URL: http://arxiv.org/abs/2511.01605v1
Date: Mon, 03 Nov 2025 14:07:53 GMT
Title: Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent
Authors: Daniel Busbib, Ami Wiesel,
Abstract summary: We revisit Toeplitz covariance estimation through the lens of simple descent (GD)<n>We show that when $K = P$, GD may converge to suboptimal solutions.<n>We propose an accelerated GD variant with separate learning rates for amplitudes and frequencies.
Score: 1.7188280334580195
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We consider covariance estimation under Toeplitz structure. Numerous sophisticated optimization methods have been developed to maximize the Gaussian log-likelihood under Toeplitz constraints. In contrast, recent advances in deep learning demonstrate the surprising power of simple gradient descent (GD) applied to overparameterized models. Motivated by this trend, we revisit Toeplitz covariance estimation through the lens of overparameterized GD. We model the $P\times P$ covariance as a sum of $K$ complex sinusoids with learnable parameters and optimize them via GD. We show that when $K = P$, GD may converge to suboptimal solutions. However, mild overparameterization ($K = 2P$ or $4P$) consistently enables global convergence from random initializations. We further propose an accelerated GD variant with separate learning rates for amplitudes and frequencies. When frequencies are fixed and only amplitudes are optimized, we prove that the optimization landscape is asymptotically benign and any stationary point recovers the true covariance. Finally, numerical experiments demonstrate that overparameterized GD can match or exceed the accuracy of state-of-the-art methods in challenging settings, while remaining simple and scalable.

Related papers

Information Hidden in Gradients of Regression with Target Noise [2.8911861322232686]
We show that the gradients alone can reveal the Hessian.<n>We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs.
arXiv Detail & Related papers (2026-01-26T14:50:16Z)
Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
The Area Under the ROC Curve (AUC) is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints.<n>We present two simple instance-wise minimax reformulations to close the approximation gap of PAUC optimization.<n>The resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(-2/3)$ for typical one-way and two-way PAUCs.
arXiv Detail & Related papers (2025-12-01T02:52:33Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods. In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation. This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z)
Stochastic Zeroth order Descent with Structured Directions [10.604744518360464]
We introduce and analyze Structured Zeroth order Descent (SSZD), a finite difference approach that approximates a gradient on a set $lleq d directions, where $d is the dimension of the ambient space. For convex convex we prove almost sure convergence of functions on $O( (d/l) k-c1/2$)$ for every $c1/2$, which is arbitrarily close to the one of the Gradient Descent (SGD) in terms of one number of iterations.
arXiv Detail & Related papers (2022-06-10T14:00:06Z)
Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise. We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal. Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z)
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement. For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts. We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z)
Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication. We prove that preconditioning has an additional benefit that has been previously unexplored. It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability. We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections. Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z)
Laplace Matching for fast Approximate Inference in Generalized Linear Models [27.70274403550477]
We propose an approximate inference framework primarily designed to be computationally cheap while still achieving high approximation quality. The concept, which we call emphLaplace Matching, involves closed-form, approximate, bi-directional transformations between the parameter spaces of exponential families. This effectively turns inference in GLMs into conjugate inference (with small approximation errors)
arXiv Detail & Related papers (2021-05-07T08:25:17Z)
A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG) $lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG. $mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Private Stochastic Non-Convex Optimization: Adaptive Algorithms and Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization. We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.