4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
- URL: http://arxiv.org/abs/2506.18839v1
- Date: Wed, 18 Jun 2025 23:44:59 GMT
- Title: 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
- Authors: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka,
- Abstract summary: We propose the first framework capable of computing a 4D-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture.<n>In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design.<n>In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training.
- Score: 66.20991603309054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.
Related papers
- 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation [23.361360623083943]
We present 4DVD, a video diffusion model that generates 4D content in a decoupled manner.<n>To train 4DVD, we collect a dynamic 3D dataset called D-averse from a benchmark.<n>Experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation.
arXiv Detail & Related papers (2025-08-06T14:08:36Z) - 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time [74.07107064085409]
4D-LRM is the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary view-time combinations.<n>It learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time.<n>It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2025-06-23T17:57:47Z) - Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion [52.0192865857058]
We propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video.<n>Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
arXiv Detail & Related papers (2025-03-28T17:14:48Z) - Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion [22.11178016375823]
DimensionX is a framework designed to generate 3D and 4D scenes from just a single image with video diffusion.
Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames.
arXiv Detail & Related papers (2024-11-07T18:07:31Z) - Controlling Space and Time with Diffusion Models [34.7002868116714]
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS)<n>We enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data.<n>4DiM is the first-ever NVS method with intuitive metric-scale camera pose control.
arXiv Detail & Related papers (2024-07-10T17:23:33Z) - NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed
Neural Radiance Fields [99.57774680640581]
We present an efficient framework capable of fast reconstruction, compact modeling, and streamable rendering.
We propose to decompose the 4D space according to temporal characteristics. Points in the 4D space are associated with probabilities belonging to three categories: static, deforming, and new areas.
arXiv Detail & Related papers (2022-10-28T07:11:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.