PIPE : Parallelized Inference Through Post-Training Quantization
Ensembling of Residual Expansions
- URL: http://arxiv.org/abs/2311.15806v1
- Date: Mon, 27 Nov 2023 13:29:34 GMT
- Title: PIPE : Parallelized Inference Through Post-Training Quantization
Ensembling of Residual Expansions
- Authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
- Abstract summary: PIPE is a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
It achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width.
- Score: 23.1120983784623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) are ubiquitous in computer vision and natural
language processing, but suffer from high inference cost. This problem can be
addressed by quantization, which consists in converting floating point
perations into a lower bit-width format. With the growing concerns on privacy
rights, we focus our efforts on data-free methods. However, such techniques
suffer from their lack of adaptability to the target devices, as a hardware
typically only support specific bit widths. Thus, to adapt to a variety of
devices, a quantization method shall be flexible enough to find good accuracy
v.s. speed trade-offs for every bit width and target device. To achieve this,
we propose PIPE, a quantization method that leverages residual error expansion,
along with group sparsity and an ensemble approximation for better
parallelization. PIPE is backed off by strong theoretical guarantees and
achieves superior performance on every benchmarked application (from vision to
NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to
ternary quantization).
Related papers
- AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - Efficient and Mathematically Robust Operations for Certified Neural
Networks Inference [3.666326242924816]
Concerns about certification have come up for machine learning (ML) and neural networks (NNs)
This paper delves into the inference stage and the requisite hardware, highlighting the challenges associated with IEEE 754 floating-point arithmetic.
By evaluating diverse summation and dot product algorithms, we aim to mitigate issues related to non-associativity.
arXiv Detail & Related papers (2024-01-16T09:22:38Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Scaled Quantization for the Vision Transformer [0.0]
Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks.
This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
arXiv Detail & Related papers (2023-03-23T18:31:21Z) - REx: Data-Free Residual Quantization Error Expansion [32.87131159997359]
Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost.
With the growing concerns on privacy rights, we focus our efforts on data-free methods.
We propose REx, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
arXiv Detail & Related papers (2022-03-28T11:04:45Z) - Bitwidth Heterogeneous Federated Learning with Progressive Weight
Dequantization [58.31288475660333]
We introduce a pragmatic Federated Learning scenario with bitwidth Heterogeneous Federated Learning (BHFL)
BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration.
We propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights.
arXiv Detail & Related papers (2022-02-23T12:07:02Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Understanding and Overcoming the Challenges of Efficient Transformer
Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.