A Learned Performance Model for Tensor Processing Units
- URL: http://arxiv.org/abs/2008.01040v2
- Date: Thu, 18 Mar 2021 04:49:15 GMT
- Title: A Learned Performance Model for Tensor Processing Units
- Authors: Samuel J. Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou,
Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows
- Abstract summary: We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances.
We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks.
It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
- Score: 5.733911161090224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate hardware performance models are critical to efficient code
generation. They can be used by compilers to make heuristic decisions, by
superoptimizers as a minimization objective, or by autotuners to find an
optimal configuration for a specific program. However, they are difficult to
develop because contemporary processors are complex, and the recent
proliferation of deep learning accelerators has increased the development
burden. We demonstrate a method of learning performance models from a corpus of
tensor computation graph programs for Tensor Processing Unit (TPU) instances.
We show that our learned model outperforms a heavily-optimized analytical
performance model on two tasks -- tile-size selection and operator fusion --
and that it helps an autotuner discover faster programs in a setting where
access to TPUs is limited or expensive.
Related papers
- Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Leveraging Reinforcement Learning and Large Language Models for Code
Optimization [14.602997316032706]
This paper introduces a new framework to decrease the complexity of code optimization.
The proposed framework builds on large language models (LLMs) and reinforcement learning (RL)
We run several experiments on the PIE dataset using a CodeT5 language model and RRHF, a new reinforcement learning algorithm.
arXiv Detail & Related papers (2023-12-09T19:50:23Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep
Learning Compilation [8.64220475114214]
oneDNN Graph Compiler employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation.
Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical computation graphs.
arXiv Detail & Related papers (2023-01-03T19:52:17Z) - Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z) - TensorIR: An Abstraction for Automatic Tensorized Program Optimization [22.812702519665617]
We presentIR, a compiler for optimizing programs with tensor computation primitives.
We build an end-to-end framework on top of our compilation to automatically optimize deep learning models for given tensor computation primitives.
arXiv Detail & Related papers (2022-07-09T16:28:57Z) - Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to
Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling.
It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.