Related papers: XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

URL: http://arxiv.org/abs/2502.06924v4
Date: Mon, 31 Mar 2025 03:26:29 GMT
Title: XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units
Authors: Arghadip Das, Arnab Raha, Shamik Kundu, Soumendu Kumar Ghosh, Deepak Mathaikutty, Vijay Raghunathan,
Abstract summary: State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks.<n>XAMBA is the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs.<n>XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations.
Score: 0.6063137165121326
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, making them ideal for long-sequence applications in NLP, vision, and edge AI, including real-time transcription, translation, and contextual search. These applications require lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. Designing specialized accelerators for every emerging neural network is costly and impractical; instead, optimizing models for existing NPUs in AI PCs provides a scalable solution. To this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. XAMBA follows a three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet KPI requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. Additionally, ActiBA enhances performance by approximating expensive activation functions (e.g., Swish, Softplus) using piecewise linear mappings, reducing latency with minimal accuracy loss. Evaluations on an Intel Core Ultra Series 2 AI PC show that XAMBA achieves up to 4.8X speed-up over the baseline. Our implementation is available at https://github.com/arghadippurdue/XAMBA.

Related papers

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design [4.251127523213289]
TriGen is a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design.<n>We show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.
arXiv Detail & Related papers (2026-02-13T14:28:31Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations [4.776283807742058]
eIQ Neutron efficient-NPU is integrated into a commercial flagship MPU.<n>Our solution achieves an average speedup of 1.8x (4x peak) at equal TOPS and memory resources across standard AI-benchmarks.
arXiv Detail & Related papers (2025-09-17T19:45:51Z)
eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing [14.932572899503935]
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data.<n>eMamba is a comprehensive end-to-end hardware acceleration framework specifically designed for deploying Mamba models on edge platforms.<n>We show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$times$ fewer parameters.
arXiv Detail & Related papers (2025-08-14T06:08:05Z)
Systolic Array-based Accelerator for Structured State-Space Models [1.137896937254823]
State-Space Models (SSMs) process very long data sequences more efficiently than recurrent and Transformer-based models.<n>In this paper, we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs.<n>EpochCore achieves on average 2000x improvement in performance on LRA datasets compared to a GPU.
arXiv Detail & Related papers (2025-07-29T00:01:57Z)
Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
POLARON: Precision-aware On-device Learning and Adaptive Runtime-cONfigurable AI acceleration [0.0]
This work presents a SIMD-enabled, multi-precision MAC engine that performs efficient multiply-accumulate operations.<n>The architecture incorporates a layer adaptive precision strategy to align computational accuracy with workload sensitivity.<n>Results demonstrate up to 2x improvement in PDP and 3x reduction in resource usage compared to SoTA designs.
arXiv Detail & Related papers (2025-06-10T13:33:02Z)
Accelerating 3D Gaussian Splatting with Neural Sorting and Axis-Oriented Rasterization [14.87046071090259]
3D Gaussian Splatting (3DGS) has recently gained significant attention for high-quality and efficient view synthesis.<n>Despite its impressive algorithmic performance, real-time rendering on resource-constrained devices remains a major challenge due to tight power and area budgets.
arXiv Detail & Related papers (2025-06-08T10:14:54Z)
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv Detail & Related papers (2024-12-24T02:27:44Z)
Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models. We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z)
Enhancing MOTION2NX for Efficient, Scalable and Secure Image Inference using Convolutional Neural Networks [4.407841002228536]
We use the ABY2.0 SMPC protocol implemented on the C++ based MOTION2NX framework for secure convolutional neural network (CNN) inference application with semi-honest security. We also present a novel splitting algorithm that divides the computations at each CNN layer into multiple chunks.
arXiv Detail & Related papers (2024-08-29T09:50:21Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling. This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power. We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)
A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices. Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data. We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.