Related papers: Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

URL: http://arxiv.org/abs/2101.08878v1
Date: Thu, 21 Jan 2021 22:59:08 GMT
Title: Efficient MPI-based Communication for GPU-Accelerated Dask Applications
Authors: Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni and Dhabaleswar K. Panda
Abstract summary: This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask. MPI4Dask exploits mpi4py over MVAPICH2-GDR, which is a GPU-aware implementation of the Message Passing Interface standard. Our latency and throughput comparisons suggest that MPI4Dask outperforms UCX by 6x for 1 Byte message and 4x for large messages.
Score: 1.211955119100926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks using UCX-Py -- a Cython wrapper to UCX. This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask -- that is targeted for modern HPC clusters built with GPUs. MPI4Dask exploits mpi4py over MVAPICH2-GDR, which is a GPU-aware implementation of the Message Passing Interface (MPI) standard. MPI4Dask provides point-to-point asynchronous I/O communication coroutines, which are non-blocking concurrent operations defined using the async/await keywords from the Python's asyncio framework. Our latency and throughput comparisons suggest that MPI4Dask outperforms UCX by 6x for 1 Byte message and 4x for large messages (2 MBytes and beyond) respectively. We also conduct comparative performance evaluation of MPI4Dask with UCX using two benchmark applications: 1) sum of cuPy array with its transpose, and 2) cuDF merge. MPI4Dask speeds up the overall execution time of the two applications by an average of 3.47x and 3.11x respectively on an in-house cluster built with NVIDIA Tesla V100 GPUs for 1-6 Dask workers. We also perform scalability analysis of MPI4Dask against UCX for these applications on TACC's Frontera (GPU) system with upto 32 Dask workers on 32 NVIDIA Quadro RTX 5000 GPUs and 256 CPU cores. MPI4Dask speeds up the execution time for cuPy and cuDF applications by an average of 1.71x and 2.91x respectively for 1-32 Dask workers on the Frontera (GPU) system.

Related papers

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.743943561871825]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
HDCC: A Hyperdimensional Computing compiler for classification on embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code. name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend. To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z)
HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance. We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems. Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z)
OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems [1.066106854070245]
OMB-Py is the first communication benchmark suite for parallel Python applications. OMB-Py consists of a variety of point-to-point and collective communication benchmark tests. We report up to 106x speedup on 224 CPU cores compared to sequential execution.
arXiv Detail & Related papers (2021-10-20T16:59:14Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters. Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach [6.616902691349208]
A growing number of applications implement predictive functions using deep learning models. Deep learning frameworks such as TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy. They are not well suited to execute quantized models on a variety of hardware platforms.
arXiv Detail & Related papers (2020-06-18T01:38:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.