Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware
Multifaceted Optimizations
- URL: http://arxiv.org/abs/2008.04567v1
- Date: Tue, 11 Aug 2020 07:50:34 GMT
- Title: Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware
Multifaceted Optimizations
- Authors: Yongchao Liu, Yue Jin, Yong Chen, Teng Teng, Hang Ou, Rui Zhao, Yao
Zhang
- Abstract summary: Woodpecker-DL (WPK) is a hardware-aware deep learning framework.
WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference.
We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
- Score: 15.659251804042748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerating deep model training and inference is crucial in practice.
Existing deep learning frameworks usually concentrate on optimizing training
speed and pay fewer attentions to inference-specific optimizations. Actually,
model inference differs from training in terms of computation, e.g. parameters
are refreshed each gradient update step during training, but kept invariant
during inference. These special characteristics of model inference open new
opportunities for its optimization. In this paper, we propose a hardware-aware
optimization framework, namely Woodpecker-DL (WPK), to accelerate inference by
taking advantage of multiple joint optimizations from the perspectives of graph
optimization, automated searches, domain-specific language (DSL) compiler
techniques and system-level exploration. In WPK, we investigated two new
automated search approaches based on genetic algorithm and reinforcement
learning, respectively, to hunt the best operator code configurations targeting
specific hardware. A customized DSL compiler is further attached to these
search algorithms to generate efficient codes. To create an optimized inference
plan, WPK systematically explores high-speed operator implementations from
third-party libraries besides our automatically generated codes and singles out
the best implementation per operator for use. Extensive experiments
demonstrated that on a Tesla P100 GPU, we can achieve the maximum speedup of
5.40 over cuDNN and 1.63 over TVM on individual convolution operators, and run
up to 1.18 times faster than TensorRT for end-to-end model inference.
Related papers
- Leveraging Reinforcement Learning and Large Language Models for Code
Optimization [14.602997316032706]
This paper introduces a new framework to decrease the complexity of code optimization.
The proposed framework builds on large language models (LLMs) and reinforcement learning (RL)
We run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm.
arXiv Detail & Related papers (2023-12-09T19:50:23Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - dPRO: A Generic Profiling and Optimization System for Expediting
Distributed DNN Training [12.413533491501548]
This paper proposes dPRO, a tool to identify performance bottlenecks in distributed training systems.
We implement dPRO on multiple deep learning frameworks (PyTorch, MXNet, AllReduce and Server architecture) and representative communication schemes.
Extensive experiments show that dPRO predicts performance of distributed training in various settings with5% errors in most cases and finds optimization strategies with up to87.1%-up over the baselines.
arXiv Detail & Related papers (2022-05-05T07:15:25Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z) - Learning to Optimize: A Primer and A Benchmark [94.29436694770953]
Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods.
This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization.
arXiv Detail & Related papers (2021-03-23T20:46:20Z) - Self-Directed Online Machine Learning for Topology Optimization [58.920693413667216]
Self-directed Online Learning Optimization integrates Deep Neural Network (DNN) with Finite Element Method (FEM) calculations.
Our algorithm was tested by four types of problems including compliance minimization, fluid-structure optimization, heat transfer enhancement and truss optimization.
It reduced the computational time by 2 5 orders of magnitude compared with directly using methods, and outperformed all state-of-the-art algorithms tested in our experiments.
arXiv Detail & Related papers (2020-02-04T20:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.