A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking
- URL: http://arxiv.org/abs/2505.19858v1
- Date: Mon, 26 May 2025 11:45:10 GMT
- Title: A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking
- Authors: Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler,
- Abstract summary: We propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion.<n>To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks.
- Score: 47.312955861553995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.
Related papers
- MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval [4.747939043785552]
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval.<n>In this work, we investigate long-sequence video content understanding to address information redundancy issues.<n>We introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task.
arXiv Detail & Related papers (2025-06-04T01:08:03Z) - VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos.<n>VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos.<n>Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z) - CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion [22.58710742780161]
CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion.<n> CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
arXiv Detail & Related papers (2025-03-01T06:13:13Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - Three-Stage Cascade Framework for Blurry Video Frame Interpolation [23.38547327916875]
Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos.
BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance.
We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
arXiv Detail & Related papers (2023-10-09T03:37:30Z) - Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring [70.06559269075352]
We propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation.<n>To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability.<n>Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Video Dehazing via a Multi-Range Temporal Alignment Network with
Physical Prior [117.6741444489174]
Video dehazing aims to recover haze-free frames with high visibility and contrast.
This paper presents a novel framework to explore the physical haze priors and aggregate temporal information.
We construct the first large-scale outdoor video dehazing benchmark dataset.
arXiv Detail & Related papers (2023-03-17T03:44:17Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Lightweight Attentional Feature Fusion for Video Retrieval by Text [7.042239213092635]
We aim for feature fusion for both ends within a unified framework.
We propose Lightweight Attentional Feature Fusion (LAFF)
LAFF performs feature fusion at both early and late stages and at both video and text ends.
arXiv Detail & Related papers (2021-12-03T10:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.