Efficient Execution of Quantized Deep Learning Models: A Compiler
Approach
- URL: http://arxiv.org/abs/2006.10226v1
- Date: Thu, 18 Jun 2020 01:38:10 GMT
- Title: Efficient Execution of Quantized Deep Learning Models: A Compiler
Approach
- Authors: Animesh Jain, Shoubhik Bhattacharya, Masahiro Masuda, Vin Sharma and
Yida Wang
- Abstract summary: A growing number of applications implement predictive functions using deep learning models.
Deep learning frameworks such as TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy.
They are not well suited to execute quantized models on a variety of hardware platforms.
- Score: 6.616902691349208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A growing number of applications implement predictive functions using deep
learning models, which require heavy use of compute and memory. One popular
technique for increasing resource efficiency is 8-bit integer quantization, in
which 32-bit floating point numbers (fp32) are represented using shorter 8-bit
integer numbers. Although deep learning frameworks such as TensorFlow, TFLite,
MXNet, and PyTorch enable developers to quantize models with only a small drop
in accuracy, they are not well suited to execute quantized models on a variety
of hardware platforms. For example, TFLite is optimized to run inference on ARM
CPU edge devices but it does not have efficient support for Intel CPUs and
Nvidia GPUs. In this paper, we address the challenges of executing quantized
deep learning models on diverse hardware platforms by proposing an augmented
compiler approach. A deep learning compiler such as Apache TVM can enable the
efficient execution of model from various frameworks on various targets. Many
deep learning compilers today, however, are designed primarily for fp32
computation and cannot optimize a pre-quantized INT8 model. To address this
issue, we created a new dialect called Quantized Neural Network (QNN) that
extends the compiler's internal representation with a quantization context.
With this quantization context, the compiler can generate efficient code for
pre-quantized models on various hardware platforms. As implemented in Apache
TVM, we observe that the QNN-augmented deep learning compiler achieves speedups
of 2.35x, 2.15x, 1.35x and 1.40x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla
T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32
execution, and comparable performance to the state-of-the-art
framework-specific solutions.
Related papers
- DeepliteRT: Computer Vision at the Edge [40.44316688055993]
DeepliteRT is an end-to-end solution for compilation, tuning, and inference of ultra low-bit models on ARM devices.
We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines.
arXiv Detail & Related papers (2023-09-19T18:58:38Z) - Compressed Real Numbers for AI: a case-study using a RISC-V CPU [2.0516276923852415]
We focus on two families of formats that have achieved interesting results in compressing binary32 numbers in machine learning applications.
We propose a way to decompress a tensor of bfloat/posits just before computations.
arXiv Detail & Related papers (2023-09-11T07:54:28Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization,
Quantizations, Memory Optimizations, and More [26.748770505062378]
SLIDE is a C++ implementation of a sparse hash table based back-propagation.
We show how SLIDE's computations allow for a unique possibility of vectorization via AVX (Advanced Vector Extensions-512)
Our experiments are focused on large (hundreds of millions of parameters) recommendation and NLP models.
arXiv Detail & Related papers (2021-03-06T02:13:43Z) - FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs.
fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime.
The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.