VideoMamba: Spatio-Temporal Selective State Space Model
- URL: http://arxiv.org/abs/2407.08476v1
- Date: Thu, 11 Jul 2024 13:11:21 GMT
- Title: VideoMamba: Spatio-Temporal Selective State Space Model
- Authors: Jinyoung Park, Hee-Seon Kim, Kangwook Ko, Minbeom Kim, Changick Kim,
- Abstract summary: VideoMamba is a novel adaptation of the pure Mamba architecture, specifically designed for video recognition.
VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos.
Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
- Score: 18.310796559944347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce VideoMamba, a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. Unlike transformers that rely on self-attention mechanisms leading to high computational costs by quadratic complexity, VideoMamba leverages Mamba's linear complexity and selective SSM mechanism for more efficient processing. The proposed Spatio-Temporal Forward and Backward SSM allows the model to effectively capture the complex relationship between non-sequential spatial and sequential temporal information in video. Consequently, VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos, demonstrated by competitive performance and outstanding efficiency on a variety of video understanding benchmarks. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
Related papers
- VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining [31.44538839153902]
VideoMAP is a Hybrid Mamba-Transformer framework featuring a novel pre-training approach.
We show that VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data.
We also demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models.
arXiv Detail & Related papers (2025-03-16T03:01:07Z) - A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts [29.05750068740863]
Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features.
We propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data.
MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs.
arXiv Detail & Related papers (2025-03-08T05:20:52Z) - H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving [26.536195829285855]
We propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos.
C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions.
Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query.
arXiv Detail & Related papers (2025-01-08T06:26:16Z) - STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection [48.997518615379995]
Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems.
Most existing methods based on CNNs and transformers still suffer from substantial computational burdens.
We propose a lightweight and effective Mamba-based network named STNMamba to enhance the learning of spatial-temporal normality.
arXiv Detail & Related papers (2024-12-28T08:49:23Z) - Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.
In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - MambaSCI: Efficient Mamba-UNet for Quad-Bayer Patterned Video Snapshot Compressive Imaging [23.69262715870974]
Existing color video SCI reconstruction algorithms are designed based on the traditional Bayer pattern.
MambaSCI surpasses state-of-the-art methods with lower computational and memory costs.
arXiv Detail & Related papers (2024-10-18T07:02:57Z) - DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba.
We show that DeciMamba can extrapolate context lengths 25x longer than the ones seen during training, and does so without utilizing additional computational resources.
arXiv Detail & Related papers (2024-06-20T17:40:18Z) - VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation [41.54814517077309]
We introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative.
DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length.
Results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques.
arXiv Detail & Related papers (2024-05-24T18:50:27Z) - Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding [49.88140766026886]
State space model, Mamba, shows promising traits to extend its success in long sequence modeling to video modeling.
We conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority.
Our experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs.
arXiv Detail & Related papers (2024-03-14T17:57:07Z) - VideoMamba: State Space Model for Efficient Video Understanding [46.17083617091239]
VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers.
Its linear-complexity operator enables efficient long-term modeling.
VideoMamba sets a new benchmark for video understanding.
arXiv Detail & Related papers (2024-03-11T17:59:34Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.