Related papers: 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

URL: http://arxiv.org/abs/2508.04467v1
Date: Wed, 06 Aug 2025 14:08:36 GMT
Title: 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation
Authors: Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang,
Abstract summary: We present 4DVD, a video diffusion model that generates 4D content in a decoupled manner.<n>To train 4DVD, we collect a dynamic 3D dataset called D-averse from a benchmark.<n>Experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation.
Score: 23.361360623083943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is https://4dvd.github.io/

Related papers

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation [66.20991603309054]
We propose the first framework capable of computing a 4D-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture.<n>In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design.<n>In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training.
arXiv Detail & Related papers (2025-06-18T23:44:59Z)
Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z)
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models [98.03734318657848]
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video.<n>We leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis.<n>We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks.
arXiv Detail & Related papers (2024-11-27T18:57:16Z)
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency [37.96042037188354]
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.
arXiv Detail & Related papers (2024-07-24T17:59:43Z)
Controlling Space and Time with Diffusion Models [34.7002868116714]
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS)<n>We enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data.<n>4DiM is the first-ever NVS method with intuitive metric-scale camera pose control.
arXiv Detail & Related papers (2024-07-10T17:23:33Z)
Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z)
Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video [42.10482273572879]
We propose an efficient video-to-4D object generation framework called Efficient4D.<n>It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data.<n>Experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed.
arXiv Detail & Related papers (2024-01-16T18:58:36Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
We present textbf4DGen, a novel framework for grounded 4D content creation.<n>Our pipeline facilitates controllable 4D generation, enabling users to specify the motion via monocular video or adopt image-to-video generations.<n>Compared to existing video-to-4D baselines, our approach yields superior results in faithfully reconstructing input signals.
arXiv Detail & Related papers (2023-12-28T18:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.