MambaVF: State Space Model for Efficient Video Fusion
- URL: http://arxiv.org/abs/2602.06017v1
- Date: Thu, 05 Feb 2026 18:53:47 GMT
- Title: MambaVF: State Space Model for Efficient Video Fusion
- Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler,
- Abstract summary: MambaVF is an efficient fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation.<n>MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing complexity and memory costs.<n>We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods.
- Score: 44.038619918204496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
Related papers
- Trajectory-aware Shifted State Space Models for Online Video Super-Resolution [57.87099307245989]
This paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba)<n>TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames.<n>Our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% reduction complexity (in MACs)
arXiv Detail & Related papers (2025-08-14T08:42:15Z) - MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition [35.69956488221345]
MoMa is an efficient adapter framework that achieves full spatial-temporal modeling.<n>We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs.
arXiv Detail & Related papers (2025-06-29T15:14:55Z) - A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking [46.829949073521284]
We propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion.<n>UniVF leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion.<n>We also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks.
arXiv Detail & Related papers (2025-05-26T11:45:10Z) - VADMamba: Exploring State Space Models for Fast Video Anomaly Detection [4.874215132369157]
VQ-Mamba Unet (VQ-MaU) framework incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block.<n>Results validate the efficacy of the proposed VADMamba across three benchmark datasets.
arXiv Detail & Related papers (2025-03-27T05:38:12Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion [4.2474907126377115]
Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image.
We propose a Mamba-based Dual-phase Fusion model (MambaDFuse) to extract modality-specific and modality-fused features.
Our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2024-04-12T11:33:26Z) - FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation [97.99012124785177]
FLAVR is a flexible and efficient architecture that uses 3D space-time convolutions to enable end-to-end learning and inference for video framesupervised.
We demonstrate that FLAVR can serve as a useful self- pretext task for action recognition, optical flow estimation, and motion magnification.
arXiv Detail & Related papers (2020-12-15T18:59:30Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.