H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
- URL: http://arxiv.org/abs/2501.04302v1
- Date: Wed, 08 Jan 2025 06:26:16 GMT
- Title: H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
- Authors: Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, Yali Wang,
- Abstract summary: We propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos.
C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions.
Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query.
- Score: 26.536195829285855
- License:
- Abstract: With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
Related papers
- UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving [18.189392365510848]
UniMLVG is a unified framework designed to generate extended street multi-perspective videos under precise control.
By integrating single- and multi-view driving videos into the training data, our approach updates cross-frame and cross-view modules across three stages.
Our framework achieves improvements of 21.4% in FID and 36.5% in FVD.
arXiv Detail & Related papers (2024-12-06T08:27:53Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - VideoMamba: Spatio-Temporal Selective State Space Model [18.310796559944347]
VideoMamba is a novel adaptation of the pure Mamba architecture, specifically designed for video recognition.
VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos.
Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
arXiv Detail & Related papers (2024-07-11T13:11:21Z) - VideoMamba: State Space Model for Efficient Video Understanding [46.17083617091239]
VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers.
Its linear-complexity operator enables efficient long-term modeling.
VideoMamba sets a new benchmark for video understanding.
arXiv Detail & Related papers (2024-03-11T17:59:34Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.