Related papers: Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

URL: http://arxiv.org/abs/2004.14020v1
Date: Wed, 29 Apr 2020 08:32:33 GMT
Title: Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling
Authors: Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, Brighten Godfrey, Roy Campbell
Abstract summary: Caramel is a system that accelerates distributed deep learning through model-aware scheduling and communication optimizations for AllReduce. Caramel maintains the correctness of the dataflow model, is hardware-independent, and does not require any user-level or framework-level changes.
Score: 1.5785002371773138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The method of choice for parameter aggregation in Deep Neural Network (DNN) training, a network-intensive task, is shifting from the Parameter Server model to decentralized aggregation schemes (AllReduce) inspired by theoretical guarantees of better performance. However, current implementations of AllReduce overlook the interdependence of communication and computation, resulting in significant performance degradation. In this paper, we develop Caramel, a system that accelerates decentralized distributed deep learning through model-aware computation scheduling and communication optimizations for AllReduce. Caramel achieves this goal through (a) computation DAG scheduling that expands the feasible window of transfer for each parameter (transfer boundaries), and (b) network optimizations for smoothening of the load including adaptive batching and pipelining of parameter transfers. Caramel maintains the correctness of the dataflow model, is hardware-independent, and does not require any user-level or framework-level changes. We implement Caramel over TensorFlow and show that the iteration time of DNN training can be improved by up to 3.62x in a cloud environment.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z)
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale [5.206015354543744]
Shampoo is an online and optimization algorithm belonging to the AdaGrad family of methods for training neural networks. We provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch.
arXiv Detail & Related papers (2023-09-12T18:11:10Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate Compression for Split DNN Computing [5.3221129103999125]
Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads. We present an approach that addresses the challenge of optimizing the rate-accuracy-complexity trade-off. Our approach is remarkably lightweight, both during training and inference, highly effective and achieves excellent rate-distortion performance.
arXiv Detail & Related papers (2022-08-24T15:02:11Z)
Transformer Network-based Reinforcement Learning Method for Power Distribution Network (PDN) Optimization of High Bandwidth Memory (HBM) [4.829921419076774]
We propose a transformer network-based reinforcement learning (RL) method for power distribution network (PDN) optimization of high bandwidth memory (HBM) The proposed method can provide an optimal decoupling capacitor (decap) design to maximize the reduction of PDN self- and transfer seen at multiple ports.
arXiv Detail & Related papers (2022-03-29T16:27:54Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. We show that there is a natural synergy between these two settings. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z)
Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO) [0.0]
We introduce the Distributed Asynchronous and Selective Optimization (DASO) method to accelerate network training. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks.
arXiv Detail & Related papers (2021-04-12T16:02:20Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.