Communication-Efficient Adam-Type Algorithms for Distributed Data Mining
- URL: http://arxiv.org/abs/2210.07454v1
- Date: Fri, 14 Oct 2022 01:42:05 GMT
- Title: Communication-Efficient Adam-Type Algorithms for Distributed Data Mining
- Authors: Wenhan Xian, Feihu Huang, Heng Huang
- Abstract summary: We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
- Score: 93.50424502011626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed data mining is an emerging research topic to effectively and
efficiently address hard data mining tasks using big data, which are
partitioned and computed on different worker nodes, instead of one centralized
server. Nevertheless, distributed learning methods often suffer from the
communication bottleneck when the network bandwidth is limited or the size of
model is large. To solve this critical issue, many gradient compression methods
have been proposed recently to reduce the communication cost for multiple
optimization algorithms. However, the current applications of gradient
compression to adaptive gradient method, which is widely adopted because of its
excellent performance to train DNNs, do not achieve the same ideal compression
rate or convergence rate as Sketched-SGD. To address this limitation, in this
paper, we propose a class of novel distributed Adam-type algorithms
(\emph{i.e.}, SketchedAMSGrad) utilizing sketching, which is a promising
compression technique that reduces the communication cost from $O(d)$ to
$O(\log(d))$ where $d$ is the parameter dimension. In our theoretical analysis,
we prove that our new algorithm achieves a fast convergence rate of
$O(\frac{1}{\sqrt{nT}} + \frac{1}{(k/d)^2 T})$ with the communication cost of
$O(k \log(d))$ at each iteration. Compared with single-machine AMSGrad, our
algorithm can achieve the linear speedup with respect to the number of workers
$n$. The experimental results on training various DNNs in distributed paradigm
validate the efficiency of our algorithms.
Related papers
- Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Near-Optimal Sparse Allreduce for Distributed Deep Learning [18.99898181586806]
Communication overhead is one of the major obstacles to train large deep learning models at scale.
This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients.
arXiv Detail & Related papers (2022-01-19T13:56:57Z) - Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes.
An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps.
This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z) - A Distributed Training Algorithm of Generative Adversarial Networks with
Quantized Gradients [8.202072658184166]
We propose a distributed GANs training algorithm with quantized gradient, dubbed DQGAN, which is the first distributed training method with quantized gradient for GANs.
The new method trains GANs based on a specific single machine algorithm called Optimistic Mirror Descent (OMD) algorithm, and is applicable to any gradient compression method that satisfies a general $delta$-approximate compressor.
Theoretically, we establish the non-asymptotic convergence of DQGAN algorithm to first-order stationary point, which shows that the proposed algorithm can achieve a linear speedup in the
arXiv Detail & Related papers (2020-10-26T06:06:43Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.