Scaled Quantization for the Vision Transformer
- URL: http://arxiv.org/abs/2303.13601v1
- Date: Thu, 23 Mar 2023 18:31:21 GMT
- Title: Scaled Quantization for the Vision Transformer
- Authors: Yangyang Chang and Gerald E. Sobelman
- Abstract summary: Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks.
This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Quantization using a small number of bits shows promise for reducing latency
and memory usage in deep neural networks. However, most quantization methods
cannot readily handle complicated functions such as exponential and square
root, and prior approaches involve complex training processes that must
interact with floating-point values. This paper proposes a robust method for
the full integer quantization of vision transformer networks without requiring
any intermediate floating-point computations. The quantization techniques can
be applied in various hardware or software implementations, including
processor/memory architectures and FPGAs.
Related papers
- AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - PIPE : Parallelized Inference Through Post-Training Quantization
Ensembling of Residual Expansions [23.1120983784623]
PIPE is a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
It achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width.
arXiv Detail & Related papers (2023-11-27T13:29:34Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization.
We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset.
We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z) - REx: Data-Free Residual Quantization Error Expansion [32.87131159997359]
Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost.
With the growing concerns on privacy rights, we focus our efforts on data-free methods.
We propose REx, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
arXiv Detail & Related papers (2022-03-28T11:04:45Z) - ZippyPoint: Fast Interest Point Detection, Description, and Matching
through Mixed Precision Discretization [71.91942002659795]
We investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms.
ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size.
These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization.
arXiv Detail & Related papers (2022-03-07T18:59:03Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Optimal qubit assignment and routing via integer programming [0.22940141855172028]
We consider the problem of mapping a logical quantum circuit onto a given hardware with limited two-qubit connectivity.
We model this problem as an integer linear program, using a network flow formulation with binary variables.
We consider several cost functions: an approximation of the fidelity of the circuit, its total depth, and a measure of cross-talk.
arXiv Detail & Related papers (2021-06-11T15:02:26Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Integer Quantization for Deep Learning Inference: Principles and
Empirical Evaluation [4.638764944415326]
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput.
We focus on quantization techniques that are amenable to acceleration by processors with high- throughput integer math pipelines.
We present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied.
arXiv Detail & Related papers (2020-04-20T19:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.