Related papers: FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

URL: http://arxiv.org/abs/2505.18975v4
Date: Mon, 28 Jul 2025 13:28:06 GMT
Title: FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization
Authors: Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang,
Abstract summary: This paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2.<n>Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers.<n>In the output decode experiment with Mamba2-2.7B, FastMamba attains 6times higher energy efficiency than GTX 3090 GPU.
Score: 2.725187542894576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80\times and 8.90\times speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6\times higher energy efficiency than RTX 3090 GPU.

Related papers

Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z)
Implementation of high-efficiency, lightweight residual spiking neural network processor based on field-programmable gate arrays [0.49806798459446283]
This work presents an efficient residual SNN accelerator that combines algorithm and hardware co-design to optimize inference energy efficiency.<n>The proposed processor achieves a classification accuracy of 87.11% on the CIFAR-10 dataset, with an inference time of 3.98 ms per image and an energy efficiency of 183.5 FPS/W.
arXiv Detail & Related papers (2025-12-09T02:08:46Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing [14.932572899503935]
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data.<n>eMamba is a comprehensive end-to-end hardware acceleration framework specifically designed for deploying Mamba models on edge platforms.<n>We show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$times$ fewer parameters.
arXiv Detail & Related papers (2025-08-14T06:08:05Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer [0.0]
Transformer-based large language models rely heavily on matrix multiplications for attention and feed-forward layers.<n>We introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA.<n>Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine.
arXiv Detail & Related papers (2025-03-20T22:15:42Z)
LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design [13.678834244877232]
State space models (SSMs) like Mamba have recently attracted much attention.<n>We propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference.<n>We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline.
arXiv Detail & Related papers (2025-02-21T07:23:23Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV) This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs) It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z)
PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training [13.926804198202582]
Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences. Existing training framework of Mamba presents inefficiency with variable-length sequence inputs. We propose PackMamba, a high- throughput Mamba that efficiently handles variable-length sequences.
arXiv Detail & Related papers (2024-08-07T16:13:43Z)
Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA [10.630802853096462]
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations. This paper proposes a high- throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs. Using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
arXiv Detail & Related papers (2024-07-02T15:28:10Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
GPU-Accelerated Machine Learning in Non-Orthogonal Multiple Access [71.58925117604039]
Non-orthogonal multiple access (NOMA) is an interesting technology that enables massive connectivity as required in future 5G and 6G networks. We propose a neural network architecture that combines the advantages of both linear and non-linear processing.
arXiv Detail & Related papers (2022-06-13T09:38:23Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.