Related papers: Quantizing Small-Scale State-Space Models for Edge AI

Quantizing Small-Scale State-Space Models for Edge AI

URL: http://arxiv.org/abs/2506.12480v1
Date: Sat, 14 Jun 2025 12:43:47 GMT
Title: Quantizing Small-Scale State-Space Models for Edge AI
Authors: Leo Zhao, Tristan Torchet, Melika Payvand, Laura Kriener, Filippo Moro,
Abstract summary: State-space models (SSMs) have recently gained attention in deep learning for their ability to efficiently model long-range dependencies.<n>In this paper, we analyze the effects of quantization on small-scale SSMs with a focus on reducing memory and computational costs while maintaining task performance.
Score: 0.4941855521192951
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-space models (SSMs) have recently gained attention in deep learning for their ability to efficiently model long-range dependencies, making them promising candidates for edge-AI applications. In this paper, we analyze the effects of quantization on small-scale SSMs with a focus on reducing memory and computational costs while maintaining task performance. Using the S4D architecture, we first investigate post-training quantization (PTQ) and show that the state matrix A and internal state x are particularly sensitive to quantization. Furthermore, we analyze the impact of different quantization techniques applied to the parameters and activations in the S4D architecture. To address the observed performance drop after Post-training Quantization (PTQ), we apply Quantization-aware Training (QAT), significantly improving performance from 40% (PTQ) to 96% on the sequential MNIST benchmark at 8-bit precision. We further demonstrate the potential of QAT in enabling sub-8-bit precisions and evaluate different parameterization schemes for QAT stability. Additionally, we propose a heterogeneous quantization strategy that assigns different precision levels to model components, reducing the overall memory footprint by a factor of 6x without sacrificing performance. Our results provide actionable insights for deploying quantized SSMs in resource-constrained environments.

Related papers

Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models [8.589209709453026]
Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining.<n>We present a benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine.<n>Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even $3$-bit quantization can succeed on high capacity models.
arXiv Detail & Related papers (2025-07-10T16:00:27Z)
QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models [0.8474310104568011]
Structured State Space models (SSM) have emerged as a new class of deep learning models.<n>QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics.<n>We show that QAT enhances robustness to analog noise and enables structural pruning.
arXiv Detail & Related papers (2025-07-08T15:19:14Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
QMamba: Post-Training Quantization for Vision State Space Models [45.97843526485619]
State Space Models (SSMs) have gained increasing attention for vision models recently.<n>Given the computational cost of deploying SSMs on resource-limited edge devices, Post-Training Quantization (PTQ) is a technique with the potential for efficient deployment of SSMs.<n>We propose QMamba, one of the first PTQ frameworks to be designed for vision SSMs based on the analysis of the activation distributions in SSMs.
arXiv Detail & Related papers (2025-01-23T12:45:20Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques [0.0]
Quantization can achieve up to 68% reduction in model size. Int8 quantization delivers a 40% reduction in computational cost and power consumption. Int4 quantization further improves these metrics by 60%.
arXiv Detail & Related papers (2024-11-09T06:30:13Z)
Q-S5: Towards Quantized State Space Models [41.94295877935867]
State Space Models (SSMs) have emerged as a potent alternative to transformers. This paper investigates the effect of quantization on the S5 model to understand its impact on model performance.
arXiv Detail & Related papers (2024-06-13T09:53:24Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms. We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)
Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z)
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.