Related papers: VideoMambaPro: A Leap Forward for Mamba in Video Understanding

VideoMambaPro: A Leap Forward for Mamba in Video Understanding

URL: http://arxiv.org/abs/2406.19006v1
Date: Thu, 27 Jun 2024 08:45:31 GMT
Title: VideoMambaPro: A Leap Forward for Mamba in Video Understanding
Authors: Hui Lu, Albert Ali Salah, Ronald Poppe,
Abstract summary: Video understanding requires the extraction of rich-temporal representations, which transformer models achieve through self-attention. In NLP, Mamba has surfaced as an efficient alternative for transformer models. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models.
Score: 10.954210339694841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.

Related papers

MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR. MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z)
MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods [10.08926186041001]
Mamba is an efficient sequence model that rivals Transformers. Existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba. We propose MambaQuant, a post-training quantization framework consisting of: 1) Karhunen-Loeve Transformation (KLT) enhanced rotation, rendering the rotation matrix adaptable to diverse channel distributions, and 2) Smooth-Fused rotation, which equalizes channel variances and can merge additional parameters into model weights.
arXiv Detail & Related papers (2025-01-23T08:57:33Z)
MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures. It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z)
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis [18.68317727349427]
It is too early to conclude that Mamba is a better alternative to transformers for speech. We evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis.
arXiv Detail & Related papers (2024-07-13T00:35:21Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. We show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers. We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets. We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z)
Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity. We show that Mamba shares surprising similarities with linear attention Transformer. We propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z)
MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism. This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z)
Visual Mamba: A Survey and New Outlooks [33.90213491829634]
Mamba, a recent selective structured state space model, excels in long sequence modeling. Since January 2024, Mamba has been actively applied to diverse computer vision tasks. This paper reviews visual Mamba approaches, analyzing over 200 papers.
arXiv Detail & Related papers (2024-04-29T16:51:30Z)
ReMamber: Referring Image Segmentation with Mamba Twister [51.291487576255435]
ReMamber is a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism.
arXiv Detail & Related papers (2024-03-26T16:27:37Z)
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding [49.88140766026886]
State space model, Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. We conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. Our experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs.
arXiv Detail & Related papers (2024-03-14T17:57:07Z)
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts [4.293771840782942]
State Space Models (SSMs) have become serious contenders in the field of sequential modeling. MoE has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE.
arXiv Detail & Related papers (2024-01-08T18:35:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.