Related papers: SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

URL: http://arxiv.org/abs/2412.07760v1
Date: Tue, 10 Dec 2024 18:55:17 GMT
Title: SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Authors: Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang,
Abstract summary: We propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation.<n>We introduce a multi-view synchronization module to maintain appearance and geometry consistency across different viewpoints.<n>Our method enables intriguing extensions, such as re-rendering a video from novel viewpoints.
Score: 43.14498014617223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.

Related papers

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model [52.0192865857058]
We propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
arXiv Detail & Related papers (2025-03-28T17:14:48Z)
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework. It reproduces the dynamic scene of an input video at novel camera trajectories. Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z)
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation [51.328567400947435]
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors.
arXiv Detail & Related papers (2025-03-12T08:26:15Z)
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models [33.219657261649324]
TrajectoryCrafter is a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from content generation, our method achieves precise control over user-specified camera trajectories.
arXiv Detail & Related papers (2025-03-07T17:57:53Z)
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.<n>We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.<n>Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation. Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency. Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z)
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control [70.17137528953953]
Collaborative video diffusion (CVD) is trained on top of a state-of-the-art camera-control module for video generation. CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines.
arXiv Detail & Related papers (2024-05-27T17:58:01Z)
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline. Our model does not require depth as input, and does not explicitly model 3D scene geometry. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Learning to Deblur and Rotate Motion-Blurred Faces [43.673660541417995]
We train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze. We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint.
arXiv Detail & Related papers (2021-12-14T17:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.