Permutation Compressors for Provably Faster Distributed Nonconvex
Optimization
- URL: http://arxiv.org/abs/2110.03300v1
- Date: Thu, 7 Oct 2021 09:38:15 GMT
- Title: Permutation Compressors for Provably Faster Distributed Nonconvex
Optimization
- Authors: Rafa{\l} Szlendak and Alexander Tyurin and Peter Richt\'arik
- Abstract summary: We show that the MARINA method of Gorbunov et al (2021) can be considered as a state-of-the-art method in terms of theoretical communication complexity.
Theory of MARINA to support the theory of potentially em correlated compressors, extends to the method beyond the classical independent compressors setting.
- Score: 68.8204255655161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the MARINA method of Gorbunov et al (2021) -- the current
state-of-the-art distributed non-convex optimization method in terms of
theoretical communication complexity. Theoretical superiority of this method
can be largely attributed to two sources: the use of a carefully engineered
biased stochastic gradient estimator, which leads to a reduction in the number
of communication rounds, and the reliance on {\em independent} stochastic
communication compression operators, which leads to a reduction in the number
of transmitted bits within each communication round. In this paper we i) extend
the theory of MARINA to support a much wider class of potentially {\em
correlated} compressors, extending the reach of the method beyond the classical
independent compressors setting, ii) show that a new quantity, for which we
coin the name {\em Hessian variance}, allows us to significantly refine the
original analysis of MARINA without any additional assumptions, and iii)
identify a special class of correlated compressors based on the idea of {\em
random permutations}, for which we coin the term Perm$K$, the use of which
leads to $O(\sqrt{n})$ (resp. $O(1 + d/\sqrt{n})$) improvement in the
theoretical communication complexity of MARINA in the low Hessian variance
regime when $d\geq n$ (resp. $d \leq n$), where $n$ is the number of workers
and $d$ is the number of parameters describing the model we are learning. We
corroborate our theoretical results with carefully engineered synthetic
experiments with minimizing the average of nonconvex quadratics, and on
autoencoder training with the MNIST dataset.
Related papers
- Theoretical Convergence Guarantees for Variational Autoencoders [2.8167997311962942]
Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions.
This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Gradient Descent and Adam algorithms.
Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $beta$-VAE and IWAE.
arXiv Detail & Related papers (2024-10-22T07:12:38Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity [92.1840862558718]
We introduce MARINA-P, a novel method for downlink compression, employing a collection of correlated compressors.
We show that MARINA-P with permutation compressors can achieve a server-to-worker communication complexity improving with the number of workers.
We introduce M3, a method combining MARINA-P with uplink compression and a momentum step, achieving bidirectional compression with provable improvements in total communication complexity as the number of workers increases.
arXiv Detail & Related papers (2024-02-09T13:58:33Z) - Correlated Quantization for Faster Nonconvex Distributed Optimization [1.2744523252873352]
Quantization (starAli et al., 2017) is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round.
We analyze the forefront distributed non- optimization algorithm MARINA (Gbunov et al., 2022)
We expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated compressors.
arXiv Detail & Related papers (2024-01-10T19:29:17Z) - Sample Complexity of Variance-reduced Distributionally Robust Q-learning [17.96094201655567]
This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts.
A series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.
arXiv Detail & Related papers (2023-05-28T19:40:46Z) - DASHA: Distributed Nonconvex Optimization with Communication
Compression, Optimal Oracle Complexity, and No Client Synchronization [77.34726150561087]
We develop and analyze DASHA: a new family of methods for noneps distributed optimization problems.
Unlike MARINA, the new methods DASHA, DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for learning.
arXiv Detail & Related papers (2022-02-02T20:10:40Z) - Selective Multiple Power Iteration: from Tensor PCA to gradient-based
exploration of landscapes [7.648784748888189]
Selective Multiple Power Iterations (SMPI) is a new algorithm to address the important problem that consists in recovering a spike.
We show that these unexpected performances are due to a powerful mechanism in which the noise plays a key role for the signal recovery.
These results may have strong impact on both practical and theoretical applications.
arXiv Detail & Related papers (2021-12-23T01:46:41Z) - Quantum Communication Complexity of Distribution Testing [114.31181206328276]
Two players each receive $t$ samples from one distribution over $[n]$.
The goal is to decide whether their two distributions are equal, or are $epsilon$-far apart.
We show that the quantum communication complexity of this problem is $tildeO$(tepsilon2))$ qubits when distributions have low $l$-norm.
arXiv Detail & Related papers (2020-06-26T09:05:58Z) - On Biased Compression for Distributed Learning [55.89300593805943]
We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings.
We propose several new biased compressors with promising theoretical guarantees and practical performance.
arXiv Detail & Related papers (2020-02-27T19:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.