Predicting 3D representations for Dynamic Scenes
- URL: http://arxiv.org/abs/2501.16617v1
- Date: Tue, 28 Jan 2025 01:31:15 GMT
- Title: Predicting 3D representations for Dynamic Scenes
- Authors: Di Qi, Tong Yang, Beining Wang, Xiangyu Zhang, Wenqiang Zhang,
- Abstract summary: We present a novel framework for dynamic radiance field prediction given monocular video streams.<n>Our method goes a step further by generating explicit 3D representations of the dynamic scene.<n>We find that our approach emerges capabilities for geometry and semantic learning.
- Score: 29.630985082164383
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
Related papers
- StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation [6.0744834626758495]
StemVLA is a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D representations into action prediction.<n>We show that StemVLA significantly improves long-horizon task success and state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
arXiv Detail & Related papers (2026-02-27T06:43:37Z) - RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space [51.441415833480505]
RAYNOVA is a multiview world model for driving scenarios that employs a dual-causal autoregressive framework.<n>It constructs an isotropic-temporal representation across views, frames, and scales based on relative Plcker-ray positional encoding.
arXiv Detail & Related papers (2026-02-24T08:41:40Z) - VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos [52.89084603734664]
We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach.<n>Our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods.
arXiv Detail & Related papers (2025-11-07T13:25:50Z) - AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes [63.055387623861094]
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws.<n>We propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction.
arXiv Detail & Related papers (2025-10-12T15:55:44Z) - MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second [29.926373004694728]
MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives.<n>MoVieS enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework.
arXiv Detail & Related papers (2025-07-14T08:49:57Z) - TesserAct: Learning 4D Embodied World Models [66.8519958275311]
We learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.
This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent.
arXiv Detail & Related papers (2025-04-29T17:59:30Z) - PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model [23.768571323272152]
PartRM is a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object.
We introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states.
Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics.
arXiv Detail & Related papers (2025-03-25T17:59:58Z) - Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency [37.96042037188354]
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.
arXiv Detail & Related papers (2024-07-24T17:59:43Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - NVFi: Neural Velocity Fields for 3D Physics Learning from Dynamic Videos [8.559809421797784]
We propose to simultaneously learn the geometry, appearance, and physical velocity of 3D scenes only from video frames.
We conduct extensive experiments on multiple datasets, demonstrating the superior performance of our method over all baselines.
arXiv Detail & Related papers (2023-12-11T14:07:31Z) - EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via
Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes.
It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping.
Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.