IOS: Inter-Operator Scheduler for CNN Acceleration
- URL: http://arxiv.org/abs/2011.01302v2
- Date: Sat, 6 Mar 2021 16:32:25 GMT
- Title: IOS: Inter-Operator Scheduler for CNN Acceleration
- Authors: Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko and Song Han
- Abstract summary: We propose Inter-Operator Scheduler (IOS) to automatically schedule multiple operators' parallel execution.
IOS consistently outperforms state-of-the-art libraries (e.g., IOSRT) by 1.1 to 1.5x on modern CNN benchmarks.
- Score: 17.509887924568435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To accelerate CNN inference, existing deep learning frameworks focus on
optimizing intra-operator parallelization. However, a single operator can no
longer fully utilize the available parallelism given the rapid advances in
high-performance hardware, resulting in a large gap between the peak
performance and the real performance. This performance gap is more severe under
smaller batch sizes. In this work, we extensively study the parallelism between
operators and propose Inter-Operator Scheduler (IOS) to automatically schedule
multiple operators' parallel execution through a novel dynamic programming
algorithm. IOS consistently outperforms state-of-the-art libraries (e.g.,
TensorRT) by 1.1 to 1.5x on modern CNN benchmarks. The code to reproduce each
experiment is available at:
https://github.com/mit-han-lab/inter-operator-scheduler.
Related papers
- RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.
It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.
We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)
It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z) - Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs [20.506357657234755]
emphOpara is a resource- and interference-aware scheduling framework to accelerate Deep Neural Network (DNN) inference on GPU.
We implement and open source a prototype of emphOpara based on PyTorch in a emphnon-intrusive manner.
Prototype experiments with representative DNN and Transformer-based models demonstrate that emphOpara outperforms the default sequential textttCUDA Graph in PyTorch.
arXiv Detail & Related papers (2023-12-16T06:48:11Z) - Automatic Task Parallelization of Dataflow Graphs in ML/DL models [0.0]
We present a Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs.
We generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format.
Preliminary results on several ML graphs demonstrate up to 1.9$times$ speedup over serial execution.
arXiv Detail & Related papers (2023-08-22T04:54:30Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Parallel Algorithms Align with Neural Execution [7.535219325248997]
Parallel algorithms however may exploit their full computational power, therefore requiring fewer layers to be executed.
This drastically reduces training times, as we observe when comparing parallel implementations of searching, sorting and finding strongly connected components to their sequential counterparts on the CLRS framework.
arXiv Detail & Related papers (2023-07-08T21:28:20Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Distributed Deep Learning Inference Acceleration using Seamless
Collaboration in Edge Computing [93.67044879636093]
This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing.
We design a novel task collaboration scheme in which the overlapping zone of the sub-tasks on secondary edge servers (ESs) is executed on the host ES, named as HALP.
Experimental results show that HALP can accelerate CNN inference in VGG-16 by 1.7-2.0x for a single task and 1.7-1.8x for 4 tasks per batch on GTX 1080TI and JETSON AGX Xavier.
arXiv Detail & Related papers (2022-07-22T18:39:09Z) - AEGNN: Asynchronous Event-based Graph Neural Networks [54.528926463775946]
Event-based Graph Neural Networks generalize standard GNNs to process events as "evolving"-temporal graphs.
AEGNNs are easily trained on synchronous inputs and can be converted to efficient, "asynchronous" networks at test time.
arXiv Detail & Related papers (2022-03-31T16:21:12Z) - Dynamic Multi-Branch Layers for On-Device Neural Machine Translation [53.637479651600586]
We propose to improve the performance of on-device neural machine translation (NMT) systems with dynamic multi-branch layers.
Specifically, we design a layer-wise dynamic multi-branch network with only one branch activated during training and inference.
At almost the same computational cost, our method achieves improvements of up to 1.7 BLEU points on the WMT14 English-German translation task and 1.8 BLEU points on the WMT20 Chinese-English translation task.
arXiv Detail & Related papers (2021-05-14T07:32:53Z) - Parallel, Self Organizing, Consensus Neural Networks [0.2578242050187029]
A new neural network architecture (PSCNN) is developed to improve performance and speed of such networks.
PSCNN shows superior performance in all cases studied.
arXiv Detail & Related papers (2020-07-30T21:02:10Z) - Efficient Algorithms for Device Placement of DNN Graph Operators [12.871398348743591]
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute.
The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs.
Recent work has shown that significant gains can be obtained with model parallelism, i.e., partitioning a neural network's computational graph onto multiple devices.
In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings.
arXiv Detail & Related papers (2020-06-29T22:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.