Related papers: Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

URL: http://arxiv.org/abs/2602.21144v1
Date: Tue, 24 Feb 2026 17:47:54 GMT
Title: Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
Authors: Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi,
Abstract summary: Selective state space models (SSMs) have rapidly become a compelling backbone for large language models.<n>But in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU.<n>This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges.
Score: 0.24148976266903474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.

Related papers

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization [19.97521786735984]
Parallel Track (PT) Transformer is a novel architectural paradigm that restructures to minimize cross-device dependencies.<n>We report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.
arXiv Detail & Related papers (2026-02-07T01:42:20Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing [4.589472292598182]
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale.<n>We present DistZO2, a memory-efficient framework for distributed zeroth-order fine-tuning of LLMs.
arXiv Detail & Related papers (2025-07-03T22:53:34Z)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism [26.923312725688735]
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity.<n>We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models.
arXiv Detail & Related papers (2025-04-03T04:20:44Z)
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models [23.045441347570886]
State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance.<n>To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration.<n>We present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones.
arXiv Detail & Related papers (2025-03-28T21:10:39Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
Efficient State Space Model via Fast Tensor Convolution and Block Diagonalization [5.260841516691153]
We propose a new state space layer based on multiple-input multiple-output SSM, called efficient SSM.<n>Our eSSM is built on the convolutional representation of multi-input and multi-input (MIMO) SSM.<n>In the model efficiency benchmark, the parameters of eSSM are only 12.89% of LSTM and 13.24% of Mamba.
arXiv Detail & Related papers (2024-02-23T12:36:31Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.