GPU-centric Communication Schemes for HPC and ML Applications
- URL: http://arxiv.org/abs/2503.24230v1
- Date: Mon, 31 Mar 2025 15:43:18 GMT
- Title: GPU-centric Communication Schemes for HPC and ML Applications
- Authors: Naveen Namashivayam,
- Abstract summary: GPU-aware communication schemes move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory.<n>A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness.<n>This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its performance bottleneck. Most programming models and runtime systems enabling the communication requirements on these systems support GPU-aware communication schemes that move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory. A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness. This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU. This work presents the need for the new communication schemes, various GPU and NIC capabilities required to implement the schemes, and the potential use-cases addressed. Based on these discussions, challenges involved in supporting the exhibited GPU-centric communication schemes are discussed.
Related papers
- FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation [6.284874558004134]
We propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism.
Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.
arXiv Detail & Related papers (2025-04-28T06:37:57Z) - Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning [2.685330831042324]
We propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization.
For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.
arXiv Detail & Related papers (2025-01-08T04:19:57Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Communication-Efficient Graph Neural Networks with Probabilistic
Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs.
This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings.
We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables.
These models are typically trained using hybrid-GPU or GPU-only configurations.
This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z) - High Performance Hyperspectral Image Classification using Graphics
Processing Units [0.0]
Real-time remote sensing applications require onboard real time processing capabilities.
Lightweight, small size and low power consumption hardware is essential for onboard real time processing systems.
arXiv Detail & Related papers (2021-05-30T09:26:03Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.