Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
- URL: http://arxiv.org/abs/2403.09626v1
- Date: Thu, 14 Mar 2024 17:57:07 GMT
- Title: Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
- Authors: Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang,
- Abstract summary: State space model, Mamba, shows promising traits to extend its success in long sequence modeling to video modeling.
We conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority.
Our experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs.
- Score: 49.88140766026886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.
Related papers
- MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.
Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features.
We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z) - VideoMambaPro: A Leap Forward for Mamba in Video Understanding [10.954210339694841]
Video understanding requires the extraction of rich-temporal representations, which transformer models achieve through self-attention.
In NLP, Mamba has surfaced as an efficient alternative for transformer models.
VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models.
arXiv Detail & Related papers (2024-06-27T08:45:31Z) - MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism.
This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics.
We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z) - Visual Mamba: A Survey and New Outlooks [33.90213491829634]
Mamba, a recent selective structured state space model, excels in long sequence modeling.
Since January 2024, Mamba has been actively applied to diverse computer vision tasks.
This paper reviews visual Mamba approaches, analyzing over 200 papers.
arXiv Detail & Related papers (2024-04-29T16:51:30Z) - A Survey on Visual Mamba [16.873917203618365]
State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling.
Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks.
This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision.
arXiv Detail & Related papers (2024-04-24T16:23:34Z) - ReMamber: Referring Image Segmentation with Mamba Twister [51.291487576255435]
ReMamber is a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block.
The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism.
arXiv Detail & Related papers (2024-03-26T16:27:37Z) - VideoMamba: State Space Model for Efficient Video Understanding [46.17083617091239]
VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers.
Its linear-complexity operator enables efficient long-term modeling.
VideoMamba sets a new benchmark for video understanding.
arXiv Detail & Related papers (2024-03-11T17:59:34Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.