Related papers: PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions

PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions

URL: http://arxiv.org/abs/2311.15806v1
Date: Mon, 27 Nov 2023 13:29:34 GMT
Title: PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions
Authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
Abstract summary: PIPE is a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. It achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width.
Score: 23.1120983784623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only support specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose PIPE, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. PIPE is backed off by strong theoretical guarantees and achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to ternary quantization).

Related papers

Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Efficient and Mathematically Robust Operations for Certified Neural Networks Inference [3.666326242924816]
Concerns about certification have come up for machine learning (ML) and neural networks (NNs) This paper delves into the inference stage and the requisite hardware, highlighting the challenges associated with IEEE 754 floating-point arithmetic. By evaluating diverse summation and dot product algorithms, we aim to mitigate issues related to non-associativity.
arXiv Detail & Related papers (2024-01-16T09:22:38Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Scaled Quantization for the Vision Transformer [0.0]
Quantization using a small number of bits shows promise for reducing latency and memory usage in deep neural networks. This paper proposes a robust method for the full integer quantization of vision transformer networks without requiring any intermediate floating-point computations.
arXiv Detail & Related papers (2023-03-23T18:31:21Z)
REx: Data-Free Residual Quantization Error Expansion [32.87131159997359]
Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. With the growing concerns on privacy rights, we focus our efforts on data-free methods. We propose REx, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
arXiv Detail & Related papers (2022-03-28T11:04:45Z)
Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization [58.31288475660333]
We introduce a pragmatic Federated Learning scenario with bitwidth Heterogeneous Federated Learning (BHFL) BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration. We propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights.
arXiv Detail & Related papers (2022-02-23T12:07:02Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Understanding and Overcoming the Challenges of Efficient Transformer Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.