Seeing World Dynamics in a Nutshell
- URL: http://arxiv.org/abs/2502.03465v1
- Date: Wed, 05 Feb 2025 18:59:52 GMT
- Title: Seeing World Dynamics in a Nutshell
- Authors: Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, Xinchao Wang,
- Abstract summary: NutWorld is a framework that transforms monocular videos into dynamic 3D representations in a single forward pass.
We demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling downstream applications in real-time.
- Score: 132.79736435144403
- License:
- Abstract: We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.
Related papers
- SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video [26.468480933928458]
We propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting framework for high-quality reconstruction and fast rendering from monocular videos.
At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories.
We present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency.
arXiv Detail & Related papers (2024-12-13T09:09:14Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos.
Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth.
We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z) - Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses [9.529416246409355]
We present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input.
As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation.
arXiv Detail & Related papers (2024-04-22T17:59:50Z) - Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments.
We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs)
SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Decoupling Dynamic Monocular Videos for Dynamic View Synthesis [50.93409250217699]
We tackle the challenge of dynamic view synthesis from dynamic monocular videos in an unsupervised fashion.
Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints.
arXiv Detail & Related papers (2023-04-04T11:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.