ALT: Breaking the Wall between Graph and Operator Level Optimizations
for Deep Learning Compilation
- URL: http://arxiv.org/abs/2210.12415v2
- Date: Tue, 25 Oct 2022 05:28:51 GMT
- Title: ALT: Breaking the Wall between Graph and Operator Level Optimizations
for Deep Learning Compilation
- Authors: Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran
Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, Guihai Chen
- Abstract summary: ALT is a compiler that performs joint graph- and operator-level optimizations for deep models.
JOG significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance and end-to-end inference performance.
- Score: 38.8918502461244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models rely on highly optimized tensor libraries for efficient
inference on heterogeneous hardware. Current deep compilers typically
predetermine layouts of tensors and then optimize loops of operators. However,
such unidirectional and one-off workflow strictly separates graph-level
optimization and operator-level optimization into different system layers,
missing opportunities for unified tuning. This paper proposes ALT, a compiler
that performs joint graph- and operator-level optimizations for deep models.
JOG provides a generic transformation module to manipulate layouts and loops
with easy-to-use primitive functions. JOG further integrates an auto-tuning
module that jointly optimizes graph-level data layouts and operator-level loops
while guaranteeing efficiency. Experimental results show that JOG significantly
outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single
operator performance (e.g., 1.5x speedup on average) and end-to-end inference
performance (e.g., 1.4x speedup on average).
Related papers
- Two Optimizers Are Better Than One: LLM Catalyst Empowers Gradient-Based Optimization for Prompt Tuning [69.95292905263393]
We show that gradient-based optimization and large language models (MsLL) are complementary to each other, suggesting a collaborative optimization approach.
Our code is released at https://www.guozix.com/guozix/LLM-catalyst.
arXiv Detail & Related papers (2024-05-30T06:24:14Z) - Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications.
Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs.
We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Learning to Generalize Provably in Learning to Optimize [185.71326306329678]
Learning to optimize (L2O) has gained increasing popularity, which automates the design of optimizees by data-driven approaches.
Current L2O methods often suffer from poor generalization performance in at least two folds.
We propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework.
arXiv Detail & Related papers (2023-02-22T01:17:31Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep
Learning Compilation [8.64220475114214]
oneDNN Graph Compiler employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation.
Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical computation graphs.
arXiv Detail & Related papers (2023-01-03T19:52:17Z) - AGO: Boosting Mobile AI Inference Performance by Removing Constraints on
Graph Optimization [6.4284258345779435]
AGO is a framework for graph optimization with arbitrary structures to boost the inference performance of deep models.
We propose intensive operator fusion to stitch multiple complex operators together for better performance.
We show that our system can improve the inference performance by up to 3.3x when compared with state-of-the-art deep compilers.
arXiv Detail & Related papers (2022-12-02T07:16:49Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Static Neural Compiler Optimization via Deep Reinforcement Learning [1.458855293397494]
In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem.
Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training.
We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents.
arXiv Detail & Related papers (2020-08-20T13:16:29Z) - Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware
Multifaceted Optimizations [15.659251804042748]
Woodpecker-DL (WPK) is a hardware-aware deep learning framework.
WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference.
We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
arXiv Detail & Related papers (2020-08-11T07:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.