SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
- URL: http://arxiv.org/abs/2503.18933v1
- Date: Mon, 24 Mar 2025 17:53:44 GMT
- Title: SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
- Authors: Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, Juergen Gall,
- Abstract summary: We propose a multi-modal framework for Synchronous Video Prediction (SyncVP)<n>SyncVP builds on modality-trained models and introduces an efficient pre-temporal cross-attention module.<n> Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present.
- Score: 17.3162098606835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
Related papers
- Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization [1.7820202405704466]
Video synchronization is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems.<n>Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings.<n>We introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods.
arXiv Detail & Related papers (2025-06-19T00:41:21Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos.
VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos.
Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z) - EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation [16.22243283808375]
Event-Guided Video Diffusion Model (EGVD) is a novel framework that leverages the powerful priors of pre-trained stable video diffusion models.
Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process.
Experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion.
arXiv Detail & Related papers (2025-03-26T06:33:32Z) - AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion [19.98565541640125]
We introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible video generation.
Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames.
This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence.
arXiv Detail & Related papers (2025-03-10T15:05:59Z) - Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation [47.75348821902489]
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time.
Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios.
This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality.
arXiv Detail & Related papers (2025-01-01T13:40:09Z) - Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity.
We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff)
Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.<n>Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.