Smoothness Matrices Beat Smoothness Constants: Better Communication
Compression Techniques for Distributed Optimization
- URL: http://arxiv.org/abs/2102.07245v1
- Date: Sun, 14 Feb 2021 20:55:02 GMT
- Title: Smoothness Matrices Beat Smoothness Constants: Better Communication
Compression Techniques for Distributed Optimization
- Authors: Mher Safaryan, Filip Hanzely, Peter Richt\'arik
- Abstract summary: Large scale distributed optimization has become the default tool for the training of supervised machine learning models.
We propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses.
- Score: 10.592277756185046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large scale distributed optimization has become the default tool for the
training of supervised machine learning models with a large number of
parameters and training data. Recent advancements in the field provide several
mechanisms for speeding up the training, including {\em compressed
communication}, {\em variance reduction} and {\em acceleration}. However, none
of these methods is capable of exploiting the inherently rich data-dependent
smoothness structure of the local losses beyond standard smoothness constants.
In this paper, we argue that when training supervised models, {\em smoothness
matrices} -- information-rich generalizations of the ubiquitous smoothness
constants -- can and should be exploited for further dramatic gains, both in
theory and practice. In order to further alleviate the communication burden
inherent in distributed optimization, we propose a novel communication
sparsification strategy that can take full advantage of the smoothness matrices
associated with local losses. To showcase the power of this tool, we describe
how our sparsification technique can be adapted to three distributed
optimization algorithms -- DCGD, DIANA and ADIANA -- yielding significant
savings in terms of communication complexity. The new methods always outperform
the baselines, often dramatically so.
Related papers
- High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification [50.406127962933915]
ACOWA allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases.
Results show that ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.
arXiv Detail & Related papers (2024-06-03T19:43:06Z) - Federated Optimization with Doubly Regularized Drift Correction [20.30761752651984]
Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized.
Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent.
We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints, and (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers.
We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/D
arXiv Detail & Related papers (2024-04-12T12:57:43Z) - MUSIC: Accelerated Convergence for Distributed Optimization With Inexact
and Exact Methods [6.800113478497425]
In this paper, we propose an accelerated framework named as MUSIC allowing each agent to perform multiple local updates and a single combination in each iteration.
We equip inexact and exact distributed optimization methods into this framework, thereby developing two new algorithms that exhibit accelerated linear convergence and high communication efficiency.
arXiv Detail & Related papers (2024-03-05T02:02:00Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Communication-Compressed Adaptive Gradient Method for Distributed
Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers.
Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z) - Optimization on manifolds: A symplectic approach [127.54402681305629]
We propose a dissipative extension of Dirac's theory of constrained Hamiltonian systems as a general framework for solving optimization problems.
Our class of (accelerated) algorithms are not only simple and efficient but also applicable to a broad range of contexts.
arXiv Detail & Related papers (2021-07-23T13:43:34Z) - Smoothness-Aware Quantization Techniques [0.2578242050187029]
We show that block quantization with $n$ blocks outperforms single block quantization.
We also show that our smoothness-aware quantization strategies outperform existing quantization schemes.
arXiv Detail & Related papers (2021-06-07T11:30:05Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.