Related papers: DeepliteRT: Computer Vision at the Edge

DeepliteRT: Computer Vision at the Edge

URL: http://arxiv.org/abs/2309.10878v1
Date: Tue, 19 Sep 2023 18:58:38 GMT
Title: DeepliteRT: Computer Vision at the Edge
Authors: Saad Ashfaq, Alexander Hoffman, Saptarshi Mitra, Sudhakar Sah, MohammadHossein AskariHemmat, Ehsan Saboori
Abstract summary: DeepliteRT is an end-to-end solution for compilation, tuning, and inference of ultra low-bit models on ARM devices. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines.
Score: 40.44316688055993
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x and 2.17x, respectively.

Related papers

SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Binary and Ternary Natural Language Generation [24.295815261826153]
Ternary and binary neural networks enable multiplication-free computation. They promise multiple orders of magnitude efficiency gains over full-precision networks. However, such networks have proven very difficult to optimize. We show first ternary and binary transformer models on the downstream tasks of summarization and machine translation.
arXiv Detail & Related papers (2023-06-02T18:01:02Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z)
Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime [57.5143536744084]
High performance of deep learning models comes at the expense of high computational, storage and power requirements. We introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite for deployment of ultra-low bit quantized models on Arm-based platforms.
arXiv Detail & Related papers (2022-07-18T15:05:17Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z)
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach [6.616902691349208]
A growing number of applications implement predictive functions using deep learning models. Deep learning frameworks such as TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy. They are not well suited to execute quantized models on a variety of hardware platforms.
arXiv Detail & Related papers (2020-06-18T01:38:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.