Related papers: Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications

Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications

URL: http://arxiv.org/abs/2410.13743v1
Date: Thu, 17 Oct 2024 16:39:53 GMT
Title: Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications
Authors: Yue Huang, Zhaoxian Wu, Shiqian Ma, Qing Ling,
Abstract summary: Multi-sequence approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees.
Score: 33.21958331056391
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. However, existing theoretical understandings {of} MSSA are limited: the multi-timescale analysis implies a slow convergence rate, whereas the single-timescale analysis relies on a stringent fixed point smoothness assumption. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Our theoretical findings reveal that, when all involved operators are strongly monotone, MSSA converges at a rate of $\tilde{\mathcal{O}}(K^{-1})$, where $K$ denotes the total number of iterations. In addition, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of $\mathcal{O}(K^{-\frac{1}{2}})$. These theoretical findings align with those established for single-sequence SA. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees, as validated by numerical experiments.

Related papers

A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA) In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z)
MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions. We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z)
DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server. To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z)
Metric Entropy-Free Sample Complexity Bounds for Sample Average Approximation in Convex Stochastic Programming [0.6906005491572401]
This paper studies sample average approximation (SAA) in solving convex or strongly convex programming (SP) problems. We show -- perhaps for the first time -- that SAA's sample complexity can be completely free from any quantification of metric entropy.
arXiv Detail & Related papers (2024-01-01T04:35:53Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
A Convergence Theory for Federated Average: Beyond Smoothness [28.074273047592065]
Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly. As a leading algorithm in this setting, Federated Average FedAvg, which runs Gradient Descent (SGD) in parallel on local devices, has been widely used. This paper provides a theoretical convergence study on Federated Learning.
arXiv Detail & Related papers (2022-11-03T04:50:49Z)
A Single-Timescale Analysis For Stochastic Approximation With Multiple Coupled Sequences [21.50207156675195]
We study the finite-time convergence of nonlinear approximation with multiple coupled sequences. At the heart of our analysis is the smoothness property of the fixed points in multi-sequence SA that holds in many applications.
arXiv Detail & Related papers (2022-06-21T14:13:20Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Variance-Reduced Splitting Schemes for Monotone Stochastic Generalized Equations [0.0]
We consider monotone inclusion problems where the operators may be expectation-valued. A direct application of splitting schemes is complicated by the need to resolve problems with expectation-valued maps at each step. We propose an avenue for addressing uncertainty in the mapping: Variance-reduced modified forward-backward splitting scheme.
arXiv Detail & Related papers (2020-08-26T02:33:27Z)
Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters [152.03852111442114]
Although model-agnostic metalearning (MAML) is a very successful algorithm meta-learning practice, it can have high computational complexity. Our paper shows that such complexity can significantly affect the overall convergence performance of ANIL.
arXiv Detail & Related papers (2020-06-16T19:57:48Z)
Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition [52.08417569774822]
This paper focuses on methods for solving smooth non-concave min-max problems, which have received increasing attention due to deep learning (e.g., deep AUC)
arXiv Detail & Related papers (2020-06-12T00:32:21Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.