Related papers: RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

URL: http://arxiv.org/abs/2505.03803v1
Date: Fri, 02 May 2025 08:47:49 GMT
Title: RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
Authors: Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen, Sifan Zhou, Zhihang Yuan, Dawei Yang,
Abstract summary: RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices.<n>We propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques.<n> Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14x speed up.
Score: 10.42496371916904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV. Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14x speed up.

Related papers

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation [34.14174796390669]
Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference.<n>Existing PTQ methods suffer from severe performance degradation under extreme low-bit settings.<n>We propose LRQ-DiT, an efficient and accurate PTQ framework.
arXiv Detail & Related papers (2025-08-05T14:16:11Z)
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models [12.716956318428652]
SegQuant is a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility.<n>SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
arXiv Detail & Related papers (2025-07-20T04:00:53Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations [17.975720202894905]
Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations.<n>We propose HadaNorm, a novel linear transformation that extends existing approaches by both normalizing channels activations and applying Hadamard transforms.<n>We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-06-11T16:54:34Z)
RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices [15.969537866628517]
We propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture.<n>Our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy.
arXiv Detail & Related papers (2024-12-14T15:11:07Z)
Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration [47.26304397935705]
Image restoration aims to recover high-quality images from degraded inputs.<n>Existing methods lack a unified training benchmark for iterations and configurations.<n>We introduce a large-scale IR dataset called ReSyn, which employs a novel image filtering method based on image complexity.
arXiv Detail & Related papers (2024-12-05T02:11:51Z)
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [95.98801201266099]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.<n>We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.<n>Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z)
2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment. It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z)
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models [33.372947082734946]
This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks. Our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images.
arXiv Detail & Related papers (2024-04-06T02:54:35Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.