Smoothness Matrices Beat Smoothness Constants: Better Communication
Compression Techniques for Distributed Optimization
- URL: http://arxiv.org/abs/2102.07245v1
- Date: Sun, 14 Feb 2021 20:55:02 GMT
- Title: Smoothness Matrices Beat Smoothness Constants: Better Communication
Compression Techniques for Distributed Optimization
- Authors: Mher Safaryan, Filip Hanzely, Peter Richt\'arik
- Abstract summary: Large scale distributed optimization has become the default tool for the training of supervised machine learning models.
We propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses.
- Score: 10.592277756185046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large scale distributed optimization has become the default tool for the
training of supervised machine learning models with a large number of
parameters and training data. Recent advancements in the field provide several
mechanisms for speeding up the training, including {\em compressed
communication}, {\em variance reduction} and {\em acceleration}. However, none
of these methods is capable of exploiting the inherently rich data-dependent
smoothness structure of the local losses beyond standard smoothness constants.
In this paper, we argue that when training supervised models, {\em smoothness
matrices} -- information-rich generalizations of the ubiquitous smoothness
constants -- can and should be exploited for further dramatic gains, both in
theory and practice. In order to further alleviate the communication burden
inherent in distributed optimization, we propose a novel communication
sparsification strategy that can take full advantage of the smoothness matrices
associated with local losses. To showcase the power of this tool, we describe
how our sparsification technique can be adapted to three distributed
optimization algorithms -- DCGD, DIANA and ADIANA -- yielding significant
savings in terms of communication complexity. The new methods always outperform
the baselines, often dramatically so.
Related papers
- Efficient Distributed Optimization under Heavy-Tailed Noise [32.96984712007111]
TailOPT is designed to address heavy-tailed noise with potentially gradient variance and local updates.
$Bi2Clip$ performs coordinate-wise clipping at both the inner and outers, achieving adaptive-like performance.
$Bi2Clip$ demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-02-06T15:47:18Z) - Regularized second-order optimization of tensor-network Born machines [2.8834278113855896]
Born machines (TNBMs) are quantum-inspired generative models for learning data distributions.
We present an improved second-order optimization technique for TNBM training, which significantly enhances convergence rates and the quality of the optimized model.
arXiv Detail & Related papers (2025-01-30T19:00:04Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification [50.406127962933915]
ACOWA allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases.
Results show that ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.
arXiv Detail & Related papers (2024-06-03T19:43:06Z) - MUSIC: Accelerated Convergence for Distributed Optimization With Inexact
and Exact Methods [6.800113478497425]
In this paper, we propose an accelerated framework named as MUSIC allowing each agent to perform multiple local updates and a single combination in each iteration.
We equip inexact and exact distributed optimization methods into this framework, thereby developing two new algorithms that exhibit accelerated linear convergence and high communication efficiency.
arXiv Detail & Related papers (2024-03-05T02:02:00Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Communication-Compressed Adaptive Gradient Method for Distributed
Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers.
Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z) - Optimization on manifolds: A symplectic approach [127.54402681305629]
We propose a dissipative extension of Dirac's theory of constrained Hamiltonian systems as a general framework for solving optimization problems.
Our class of (accelerated) algorithms are not only simple and efficient but also applicable to a broad range of contexts.
arXiv Detail & Related papers (2021-07-23T13:43:34Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.