VideoMamba: Spatio-Temporal Selective State Space Model
- URL: http://arxiv.org/abs/2407.08476v1
- Date: Thu, 11 Jul 2024 13:11:21 GMT
- Title: VideoMamba: Spatio-Temporal Selective State Space Model
- Authors: Jinyoung Park, Hee-Seon Kim, Kangwook Ko, Minbeom Kim, Changick Kim,
- Abstract summary: VideoMamba is a novel adaptation of the pure Mamba architecture, specifically designed for video recognition.
VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos.
Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
- Score: 18.310796559944347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce VideoMamba, a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. Unlike transformers that rely on self-attention mechanisms leading to high computational costs by quadratic complexity, VideoMamba leverages Mamba's linear complexity and selective SSM mechanism for more efficient processing. The proposed Spatio-Temporal Forward and Backward SSM allows the model to effectively capture the complex relationship between non-sequential spatial and sequential temporal information in video. Consequently, VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos, demonstrated by competitive performance and outstanding efficiency on a variety of video understanding benchmarks. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
Related papers
- MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - MambaSCI: Efficient Mamba-UNet for Quad-Bayer Patterned Video Snapshot Compressive Imaging [23.69262715870974]
Existing color video SCI reconstruction algorithms are designed based on the traditional Bayer pattern.
MambaSCI surpasses state-of-the-art methods with lower computational and memory costs.
arXiv Detail & Related papers (2024-10-18T07:02:57Z) - DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba.
We show that DeciMamba can extrapolate context lengths 25x longer than the ones seen during training, and does so without utilizing additional computational resources.
arXiv Detail & Related papers (2024-06-20T17:40:18Z) - VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations.
We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling)
Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z) - Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation [41.54814517077309]
We introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative.
DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length.
Results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques.
arXiv Detail & Related papers (2024-05-24T18:50:27Z) - Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding [49.88140766026886]
State space model, Mamba, shows promising traits to extend its success in long sequence modeling to video modeling.
We conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority.
Our experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs.
arXiv Detail & Related papers (2024-03-14T17:57:07Z) - VideoMamba: State Space Model for Efficient Video Understanding [46.17083617091239]
VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers.
Its linear-complexity operator enables efficient long-term modeling.
VideoMamba sets a new benchmark for video understanding.
arXiv Detail & Related papers (2024-03-11T17:59:34Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.