Related papers: QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models

QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models

URL: http://arxiv.org/abs/2507.06079v1
Date: Tue, 08 Jul 2025 15:19:14 GMT
Title: QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models
Authors: Sebastian Siegel, Ming-Jay Yang, Younes Bouhadjar, Maxime Fabre, Emre Neftci, John Paul Strachan,
Abstract summary: Structured State Space models (SSM) have emerged as a new class of deep learning models.<n>QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics.<n>We show that QAT enhances robustness to analog noise and enables structural pruning.
Score: 0.8474310104568011
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Structured State Space models (SSM) have recently emerged as a new class of deep learning models, particularly well-suited for processing long sequences. Their constant memory footprint, in contrast to the linearly scaling memory demands of Transformers, makes them attractive candidates for deployment on resource-constrained edge-computing devices. While recent works have explored the effect of quantization-aware training (QAT) on SSMs, they typically do not address its implications for specialized edge hardware, for example, analog in-memory computing (AIMC) chips. In this work, we demonstrate that QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics. We analyze the relation between model size and numerical precision, and show that QAT enhances robustness to analog noise and enables structural pruning. Finally, we integrate these techniques to deploy SSMs on a memristive analog in-memory computing substrate and highlight the resulting benefits in terms of computational efficiency.

Related papers

Systolic Array-based Accelerator for Structured State-Space Models [1.137896937254823]
State-Space Models (SSMs) process very long data sequences more efficiently than recurrent and Transformer-based models.<n>In this paper, we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs.<n>EpochCore achieves on average 2000x improvement in performance on LRA datasets compared to a GPU.
arXiv Detail & Related papers (2025-07-29T00:01:57Z)
Quantizing Small-Scale State-Space Models for Edge AI [0.4941855521192951]
State-space models (SSMs) have recently gained attention in deep learning for their ability to efficiently model long-range dependencies.<n>In this paper, we analyze the effects of quantization on small-scale SSMs with a focus on reducing memory and computational costs while maintaining task performance.
arXiv Detail & Related papers (2025-06-14T12:43:47Z)
Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z)
Quantum Kernel-Based Long Short-term Memory [0.30723404270319693]
We introduce the Quantum Kernel-Based Long Short-Term Memory (QK-LSTM) network to capture complex, non-linear patterns in sequential data. This quantum-enhanced architecture demonstrates efficient convergence, robust loss minimization, and model compactness. Benchmark comparisons reveal that QK-LSTM achieves performance on par with classical LSTM models, yet with fewer parameters.
arXiv Detail & Related papers (2024-11-20T11:39:30Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
SMR: State Memory Replay for Long Sequence Modeling [19.755738298836526]
This paper proposes a novel non-recursive non-uniform sample processing strategy to overcome compatibility limitations in parallel convolutional computation. We introduce State Memory Replay (SMR), which utilizes learnable memories to adjust the current state with multi-step information for generalization at sampling points different from those in the training data. Experiments on long-range modeling tasks in autoregressive language modeling and Long Range Arena demonstrate the general effectiveness of the SMR mechanism for a series of SSM models.
arXiv Detail & Related papers (2024-05-27T17:53:32Z)
HOPE for a Robust Parameterization of Long-memory State Space Models [51.66430224089725]
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. We develop a new parameterization scheme, called HOPE, for LTI systems that utilize Markov parameters within Hankel operators. Our new parameterization endows the SSM with non-decaying memory within a fixed time window, which is empirically corroborated by a sequential CIFAR-10 task with padded noise.
arXiv Detail & Related papers (2024-05-22T20:20:14Z)
Distributed Representations Enable Robust Multi-Timescale Symbolic Computation in Neuromorphic Hardware [3.961418890143814]
We describe a single-shot weight learning scheme to embed robust multi-timescale dynamics into attractor-based RSNNs.<n>We embed finite state machines into the RSNN dynamics by superimposing a symmetric autoassociative weight matrix.<n>This work introduces a scalable approach to embed robust symbolic computation through recurrent dynamics into neuromorphic hardware.
arXiv Detail & Related papers (2024-05-02T14:11:50Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling. This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms. We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.