Related papers: PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

URL: http://arxiv.org/abs/2408.03865v2
Date: Wed, 21 Aug 2024 12:08:00 GMT
Title: PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training
Authors: Haoran Xu, Ziqian Liu, Rong Fu, Zhongling Su, Zerui Wang, Zheng Cai, Zhilin Pei, Xingcheng Zhang,
Abstract summary: Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences. Existing training framework of Mamba presents inefficiency with variable-length sequence inputs. We propose PackMamba, a high- throughput Mamba that efficiently handles variable-length sequences.
Score: 13.926804198202582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.

Related papers

MemMamba: Rethinking Memory Patterns in State Space Model [6.537535831000493]
We show that selective state-space models such as Mamba have high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially.<n>Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba.<n>MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks.
arXiv Detail & Related papers (2025-09-28T14:40:58Z)
Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models [36.0400717590138]
We present OmniMamba, the first linear-architecture-based multimodal generation model. It generates both text and images through a unified next-token prediction paradigm. It achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks.
arXiv Detail & Related papers (2025-03-11T17:59:46Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression. We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
Bi-Mamba: Towards Accurate 1-Bit State Space Models [28.478762133816726]
Bi-Mamba is a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models. Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba baselines.
arXiv Detail & Related papers (2024-11-18T18:59:15Z)
Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP) Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z)
Bidirectional Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction. We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation. Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z)
DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba. We show that DeciMamba can extrapolate context lengths 25x longer than the ones seen during training, and does so without utilizing additional computational resources.
arXiv Detail & Related papers (2024-06-20T17:40:18Z)
MambaTS: Improved Selective State Space Models for Long-term Time Series Forecasting [12.08746904573603]
Mamba, based on selective state space models (SSMs), has emerged as a competitive alternative to Transformer. We propose four targeted improvements, leading to MambaTS. Experiments conducted on eight public datasets demonstrate that MambaTS achieves new state-of-the-art performance.
arXiv Detail & Related papers (2024-05-26T05:50:17Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.