Related papers: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

URL: http://arxiv.org/abs/2503.22879v2
Date: Thu, 03 Apr 2025 15:04:19 GMT
Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu,
Abstract summary: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance.<n>To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration.<n>We present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones.
Score: 11.472831744634156
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

Related papers

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism [0.24148976266903474]
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models.<n>But in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU.<n>This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges.
arXiv Detail & Related papers (2026-02-24T17:47:54Z)
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs [22.43695132973238]
We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device pruning rates for edge LLMs.<n>UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models.<n>Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device pruning rates up to 35%.
arXiv Detail & Related papers (2025-12-03T02:33:39Z)
Bi-Mamba: Towards Accurate 1-Bit State Space Models [45.70829795551893]
$textttBi-Mamba$ is a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs)<n>$textttBi-Mamba$ models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss.<n>Experiments on language modeling benchmarks demonstrate that $textttBi-Mamba$ achieves performance comparable to its full-precision counterparts.
arXiv Detail & Related papers (2024-11-18T18:59:15Z)
Quamba: A Post-Training Quantization Recipe for Selective State Space Models [8.924779222965798]
State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models.<n>We propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM.<n>Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks.
arXiv Detail & Related papers (2024-10-17T05:32:33Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z)
GroupMamba: Efficient Group-Based Visual State Space Model [66.35608254724566]
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
EfficientState Space Model viaFast Tensor Convolutionand Block Diagonalization [5.260841516691153]
We propose a new state space layer based on multiple-input multiple-output SSM, called efficient SSM. Our eSSM is built on the convolutional representation of multi-input and multi-input (MIMO) SSM. In the model efficiency benchmark, the parameters of eSSM are only 12.89% of LSTM and 13.24% of Mamba.
arXiv Detail & Related papers (2024-02-23T12:36:31Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices. Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors. Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.