Related papers: Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

URL: http://arxiv.org/abs/2312.10351v2
Date: Thu, 30 May 2024 08:01:06 GMT
Title: Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
Authors: Aodong Chen, Fei Xu, Li Han, Yuan Dong, Li Chen, Zhi Zhou, Fangming Liu,
Abstract summary: emphOpara is a resource- and interference-aware scheduling framework to accelerate Deep Neural Network (DNN) inference on GPU. We implement and open source a prototype of emphOpara based on PyTorch in a emphnon-intrusive manner. Prototype experiments with representative DNN and Transformer-based models demonstrate that emphOpara outperforms the default sequential textttCUDA Graph in PyTorch.
Score: 20.506357657234755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GPUs have become the \emph{defacto} hardware devices for accelerating Deep Neural Network (DNN) inference workloads. However, the conventional \emph{sequential execution mode of DNN operators} in mainstream deep learning frameworks cannot fully utilize GPU resources, even with the operator fusion enabled, due to the increasing complexity of model structures and a greater diversity of operators. Moreover, the \emph{inadequate operator launch order} in parallelized execution scenarios can lead to GPU resource wastage and unexpected performance interference among operators. In this paper, we propose \emph{Opara}, a resource- and interference-aware DNN \underline{Op}erator \underline{para}llel scheduling framework to accelerate DNN inference on GPUs. Specifically, \emph{Opara} first employs \texttt{CUDA Streams} and \texttt{CUDA Graph} to \emph{parallelize} the execution of multiple operators automatically. To further expedite DNN inference, \emph{Opara} leverages the resource demands of operators to judiciously adjust the operator launch order on GPUs, overlapping the execution of compute-intensive and memory-intensive operators. We implement and open source a prototype of \emph{Opara} based on PyTorch in a \emph{non-intrusive} manner. Extensive prototype experiments with representative DNN and Transformer-based models demonstrate that \emph{Opara} outperforms the default sequential \texttt{CUDA Graph} in PyTorch and the state-of-the-art operator parallelism systems by up to $1.68\times$ and $1.29\times$, respectively, yet with acceptable runtime overhead.

Related papers

Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs [1.8911962184174564]
We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs) We show that replacing emphall linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 emphaccuracy improvement. Finetuning TinyLlama citetinyllama24 with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter.
arXiv Detail & Related papers (2025-03-15T17:31:36Z)
AMPLE: Event-Driven Accelerator for Mixed-Precision Inference of Graph Neural Networks [6.4509395505998235]
Graph Neural Networks (GNNs) have recently gained attention due to their performance on non-Euclidean data. We introduce textbfAMPLE (Accelerated Message Passing Logic Engine), an FPGA accelerator leveraging a new event-driven programming flow. We develop a mixed-arithmetic architecture, enabling GNN inference to be quantized at a node-level granularity.
arXiv Detail & Related papers (2025-02-28T16:14:16Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
IOS: Inter-Operator Scheduler for CNN Acceleration [17.509887924568435]
We propose Inter-Operator Scheduler (IOS) to automatically schedule multiple operators' parallel execution. IOS consistently outperforms state-of-the-art libraries (e.g., IOSRT) by 1.1 to 1.5x on modern CNN benchmarks.
arXiv Detail & Related papers (2020-11-02T20:42:26Z)
Efficient Algorithms for Device Placement of DNN Graph Operators [12.871398348743591]
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs. Recent work has shown that significant gains can be obtained with model parallelism, i.e., partitioning a neural network's computational graph onto multiple devices. In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings.
arXiv Detail & Related papers (2020-06-29T22:45:01Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.