R2 Loss: Range Restriction Loss for Model Compression and Quantization
- URL: http://arxiv.org/abs/2303.08253v2
- Date: Sun, 11 Feb 2024 19:04:48 GMT
- Title: R2 Loss: Range Restriction Loss for Model Compression and Quantization
- Authors: Arnav Kundu, Chungkuk Yoo, Srijan Mishra, Minsik Cho, Saurabh Adya
- Abstract summary: We propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training.
R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques.
- Score: 6.218599842159466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model quantization and compression is widely used techniques to reduce usage
of computing resource at inference time. While state-of-the-art works have been
achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it
is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To
overcome the challenge, we focus on outliers in weights of a pre-trained model
which disrupt effective lower bit quantization and compression. In this work,
we propose Range Restriction Loss (R2-Loss) for building lower bit quantization
and compression friendly models by removing outliers from weights during
pre-training. By effectively restricting range of weights, we mold the overall
distribution into a tight shape to ensure high quantization bit resolution,
therefore allowing model compression and quantization techniques can to utilize
their limited numeric representation powers better. We introduce three
different, L-inf R2-Loss, its extension Margin R2-Loss and a new
Soft-Min-MaxR2-Loss to be used as an auxiliary loss during full-precision model
training. These R2-Loss can be used in different cases such as L-inf and Margin
R2-Loss would be effective for symmetric quantization, while Soft-Min-Max
R2-Loss shows better performance for model compression. In our experiment,
R2-Loss improves lower bit quantization accuracy with state-of-the-art
post-training quantization (PTQ), quantization-aware training (QAT), and model
compression techniques. With R2-Loss, MobileNet-V2 2bit weight and 8bit
activation PTQ, MobileNet-V1 2bit weight and activation QAT, ResNet18 1bit
weight compression are improved to 59.49% from 50.66%, 59.05% from 55.96%, and
52.58% from 45.54%, respectively.
Related papers
- SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images.
As these models grow larger, they require significantly more memory and suffer from higher latency.
In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression [7.6131620435684875]
SLIM is a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation.
SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods.
We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
arXiv Detail & Related papers (2024-10-12T18:36:07Z) - 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - Dynamic Dual Trainable Bounds for Ultra-low Precision Super-Resolution
Networks [82.18396309806577]
We propose a novel activation quantizer, referred to as Dynamic Dual Trainable Bounds (DDTB)
Our DDTB exhibits significant performance improvements in ultra-low precision.
For example, our DDTB achieves a 0.70dB PSNR increase on Urban100 benchmark when quantizing EDSR to 2-bit and scaling up output images to x4.
arXiv Detail & Related papers (2022-03-08T04:26:18Z) - Pruning Ternary Quantization [32.32812780843498]
Inference time, model size, and accuracy are three key factors in deep model compression.
We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method.
Our method is verified on image classification, object detection/segmentation tasks with different network structures.
arXiv Detail & Related papers (2021-07-23T02:18:00Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - PAMS: Quantized Super-Resolution via Parameterized Max Scale [84.55675222525608]
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR)
We propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively.
Experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN.
arXiv Detail & Related papers (2020-11-09T06:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.