Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs
- URL: http://arxiv.org/abs/2210.09603v1
- Date: Tue, 18 Oct 2022 05:32:13 GMT
- Title: Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs
- Authors: Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, Gennady
Pekhimenko
- Abstract summary: We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
- Score: 11.338285393619042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As deep learning models nowadays are widely adopted by both cloud services
and edge devices, the latency of deep learning model inferences becomes crucial
to provide efficient model serving. However, it is challenging to develop
efficient tensor programs for deep learning operators due to the high
complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the
rapidly growing number of operators. Deep learning compilers, such as Apache
TVM, adopt declarative scheduling primitives to lower the bar of developing
tensor programs. However, we show that this approach is insufficient to cover
state-of-the-art tensor program optimizations (e.g., double buffering). In this
paper, we propose to embed the scheduling process into tensor programs and use
dedicated mappings, called task mappings, to define the computation assignment
and ordering directly in the tensor programs. This new approach greatly
enriches the expressible optimizations by allowing developers to manipulate
tensor programs at a much finer granularity (e.g., allowing program
statement-level optimizations). We call the proposed method the
task-mapping-oriented programming paradigm. With the proposed paradigm, we
implement a deep learning compiler - Hidet. Extensive experiments on modern
convolution and transformer models show that Hidet outperforms state-of-the-art
DNN inference framework, ONNX Runtime, and compiler, TVM equipped with
scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average) with enriched
optimizations. It also reduces the tuning time by 20x and 11x compared with
AutoTVM and Ansor, respectively.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers [6.194917248699324]
This paper proposes a new technique for deep learning compilers called FTuner.
Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
arXiv Detail & Related papers (2024-07-31T08:05:33Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler
for Neural Networks [51.71682428015139]
We propose HARL, a reinforcement learning-based auto-scheduler for efficient tensor program exploration.
HarL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler.
Inference performance and search speed are also significantly improved on end-to-end neural networks.
arXiv Detail & Related papers (2022-11-21T04:15:27Z) - TensorIR: An Abstraction for Automatic Tensorized Program Optimization [22.812702519665617]
We presentIR, a compiler for optimizing programs with tensor computation primitives.
We build an end-to-end framework on top of our compilation to automatically optimize deep learning models for given tensor computation primitives.
arXiv Detail & Related papers (2022-07-09T16:28:57Z) - A Learned Performance Model for Tensor Processing Units [5.733911161090224]
We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances.
We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks.
It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
arXiv Detail & Related papers (2020-08-03T17:24:52Z) - Ansor: Generating High-Performance Tensor Programs for Deep Learning [45.437816016043534]
We present Ansor, a tensor program generation framework for deep learning applications.
Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space.
Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches.
arXiv Detail & Related papers (2020-06-11T19:40:09Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.