Empowering Distributed Training with Sparsity-driven Data Synchronization
- URL: http://arxiv.org/abs/2309.13254v2
- Date: Sat, 14 Dec 2024 00:20:13 GMT
- Title: Empowering Distributed Training with Sparsity-driven Data Synchronization
- Authors: Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shrivastava, T. S. Eugene Ng,
- Abstract summary: Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs.
We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity.
We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones.
We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48times$ speedup in training throughput.
- Score: 33.95040042348349
- License:
- Abstract: Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs. Its performance bottleneck lies in communications for gradient synchronization. Although high tensor sparsity is widely observed, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to bridge this gap. We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones. These findings give a new understanding and inspire us to develop a holistic gradient synchronization system called Zen for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48\times$ speedup in training throughput compared to the state-of-the-art methods.
Related papers
- An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks [0.5737287537823071]
Rotation equivariant graph neural networks yield state-of-the-art performance on spatial deep learning tasks.
Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly structured sparse tensor to produce a dense output vector.
We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedup over the best existing open and closed-source implementations.
arXiv Detail & Related papers (2025-01-23T08:20:47Z) - Coarse-To-Fine Tensor Trains for Compact Visual Representations [19.216356079910533]
'Prolongation Upsampling Train' is a novel method for learning tensor train representations in a coarse-to-fine manner.
We evaluate our representation along three axes: (1) compression, (2).
denoising capability, and (3) image completion capability.
arXiv Detail & Related papers (2024-06-06T17:59:23Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Dynamic Sparsity Is Channel-Level Sparsity Learner [91.31071026340746]
Dynamic sparse training (DST) is a leading sparse training approach.
Channel-aware dynamic sparse (Chase) seamlessly translates the promise of unstructured dynamic sparsity to channel-level sparsity.
Our approach translates unstructured sparsity to channel-wise sparsity.
arXiv Detail & Related papers (2023-05-30T23:33:45Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep
Learning [79.89085533866071]
This paper introduces DeepReduce, a versatile framework for the compressed communication of sparse tensors.
DeepReduce decomposes tensors in two sets, values and indices, and allows both independent and combined compression of these sets.
Our experiments with large real models demonstrate that DeepReduce transmits fewer data and imposes lower computational overhead than existing methods.
arXiv Detail & Related papers (2021-02-05T11:31:24Z) - Sparsity in Deep Learning: Pruning and growth for efficient inference
and training in neural networks [78.47459801017959]
Sparsity can reduce the memory footprint of regular networks to fit mobile devices.
We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice.
arXiv Detail & Related papers (2021-01-31T22:48:50Z) - Towards Scalable Distributed Training of Deep Learning on Public Cloud
Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training.
We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z) - ShadowSync: Performing Synchronization in the Background for Highly
Scalable Distributed Training [10.73956838502053]
We present shadowsync, a distributed framework specifically tailored to modern scale recommendation system training.
In contrast to previous works where synchronization happens as part of the training process, shadowsync separates the synchronization from training and runs it in the background.
arXiv Detail & Related papers (2020-03-07T00:26:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.