FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
- URL: http://arxiv.org/abs/2406.06858v5
- Date: Wed, 23 Oct 2024 18:45:33 GMT
- Title: FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
- Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu,
- Abstract summary: This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
- Score: 9.5114389643299
- License:
- Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.
Related papers
- Distributed Convolutional Neural Network Training on Mobile and Edge Clusters [0.9421843976231371]
Recent efforts have emerged to localize machine learning tasks fully on the edge.
This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices.
We describe an approach for distributed CNN training exclusively on mobile and edge devices.
arXiv Detail & Related papers (2024-09-11T02:44:28Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - SPEED: Streaming Partition and Parallel Acceleration for Temporal
Interaction Graph Embedding [22.68416593780539]
We introduce a novel training approach namely Streaming Edge Partitioning and Parallel Acceleration for Temporal Interaction Graph Embedding.
Our method can achieve a good balance in computing resources, computing time, and downstream task performance.
Empirical validation across 7 real-world datasets demonstrates the potential to expedite training speeds by a factor of up to 19.29x.
arXiv Detail & Related papers (2023-08-27T15:11:44Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices.
Many distributed optimization algorithms have been developed over the last decades.
They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z) - AxoNN: An asynchronous, message-driven parallel framework for
extreme-scale deep learning [1.5301777464637454]
AxoNN is a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU.
By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times.
arXiv Detail & Related papers (2021-10-25T14:43:36Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Large Graph Convolutional Network Training with GPU-Oriented Data
Communication Architecture [19.2129567657739]
Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems.
Current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features.
This approach, however, puts tremendous pressure on host memory bandwidth and the CPU.
We propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory.
arXiv Detail & Related papers (2021-03-04T21:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.