Related papers: Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

URL: http://arxiv.org/abs/2102.07245v1
Date: Sun, 14 Feb 2021 20:55:02 GMT
Title: Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization
Authors: Mher Safaryan, Filip Hanzely, Peter Richt\'arik
Abstract summary: Large scale distributed optimization has become the default tool for the training of supervised machine learning models. We propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses.
Score: 10.592277756185046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including {\em compressed communication}, {\em variance reduction} and {\em acceleration}. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, {\em smoothness matrices} -- information-rich generalizations of the ubiquitous smoothness constants -- can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms -- DCGD, DIANA and ADIANA -- yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.

Related papers

Efficient Distributed Optimization under Heavy-Tailed Noise [32.96984712007111]
TailOPT is designed to address heavy-tailed noise with potentially gradient variance and local updates. $Bi2Clip$ performs coordinate-wise clipping at both the inner and outers, achieving adaptive-like performance. $Bi2Clip$ demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-02-06T15:47:18Z)
Regularized second-order optimization of tensor-network Born machines [2.8834278113855896]
Born machines (TNBMs) are quantum-inspired generative models for learning data distributions. We present an improved second-order optimization technique for TNBM training, which significantly enhances convergence rates and the quality of the optimized model.
arXiv Detail & Related papers (2025-01-30T19:00:04Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z)
Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification [50.406127962933915]
ACOWA allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases. Results show that ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.
arXiv Detail & Related papers (2024-06-03T19:43:06Z)
Federated Optimization with Doubly Regularized Drift Correction [20.30761752651984]
Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints, and (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/D
arXiv Detail & Related papers (2024-04-12T12:57:43Z)
MUSIC: Accelerated Convergence for Distributed Optimization With Inexact and Exact Methods [6.800113478497425]
In this paper, we propose an accelerated framework named as MUSIC allowing each agent to perform multiple local updates and a single combination in each iteration. We equip inexact and exact distributed optimization methods into this framework, thereby developing two new algorithms that exhibit accelerated linear convergence and high communication efficiency.
arXiv Detail & Related papers (2024-03-05T02:02:00Z)
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers. We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers. Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z)
Optimization on manifolds: A symplectic approach [127.54402681305629]
We propose a dissipative extension of Dirac's theory of constrained Hamiltonian systems as a general framework for solving optimization problems. Our class of (accelerated) algorithms are not only simple and efficient but also applicable to a broad range of contexts.
arXiv Detail & Related papers (2021-07-23T13:43:34Z)
Smoothness-Aware Quantization Techniques [0.2578242050187029]
We show that block quantization with $n$ blocks outperforms single block quantization. We also show that our smoothness-aware quantization strategies outperform existing quantization schemes.
arXiv Detail & Related papers (2021-06-07T11:30:05Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.