Related papers: Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

URL: http://arxiv.org/abs/2503.11579v1
Date: Fri, 14 Mar 2025 16:45:23 GMT
Title: Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Authors: Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen,
Abstract summary: State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs.<n>We build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity.<n>VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step.
Score: 38.63270256142439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Related papers

M4V: Multi-Modal Mamba for Text-to-Video Generation [58.51139515986472]
Text-to-video generation has enriched content and holds potential to create powerful world simulators.<n>Modeling the vast space remains computationally demanding, particularly when employing quadratic in sequence processing.<n>We introduce a Multi-Modal Mamba framework for text-to-video generation.<n>Experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs.
arXiv Detail & Related papers (2025-06-12T17:29:40Z)
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z)
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining [31.44538839153902]
VideoMAP is a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. We show that VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data. We also demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models.
arXiv Detail & Related papers (2025-03-16T03:01:07Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation [36.44678935063189]
mmMamba is a framework for developing linear-complexity native multimodal state space models.<n>Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures.
arXiv Detail & Related papers (2025-02-18T18:59:57Z)
Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing [0.0]
State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Fast Vision Mamba (FastVim) reduces the number of recurrent steps in Vision Mamba models while still retaining model performance. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks.
arXiv Detail & Related papers (2025-02-01T23:35:20Z)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
Snakes and Ladders: Two Steps Up for VideoMamba [10.954210339694841]
In this paper, we theoretically analyze the differences between self-attention and Mamba. We propose VideoMambaPro models that surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1. Our two solutions are to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
arXiv Detail & Related papers (2024-06-27T08:45:31Z)
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning. It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.