SUDS: Scalable Urban Dynamic Scenes
- URL: http://arxiv.org/abs/2303.14536v1
- Date: Sat, 25 Mar 2023 18:55:09 GMT
- Title: SUDS: Scalable Urban Dynamic Scenes
- Authors: Haithem Turki, Jason Y. Zhang, Francesco Ferroni, Deva Ramanan
- Abstract summary: We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes.
We factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields.
Our reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers.
- Score: 46.965165390077146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes.
Prior work tends to reconstruct single video clips of short durations (up to 10
seconds). Two reasons are that such methods (a) tend to scale linearly with the
number of moving objects and input videos because a separate model is built for
each and (b) tend to require supervision via 3D bounding boxes and panoptic
labels, obtained manually or via category-specific models. As a step towards
truly open-world reconstructions of dynamic cities, we introduce two key
innovations: (a) we factorize the scene into three separate hash table data
structures to efficiently encode static, dynamic, and far-field radiance
fields, and (b) we make use of unlabeled target signals consisting of RGB
images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most
importantly, 2D optical flow.
Operationalizing such inputs via photometric, geometric, and feature-metric
reconstruction losses enables SUDS to decompose dynamic scenes into the static
background, individual objects, and their motions. When combined with our
multi-branch table representation, such reconstructions can be scaled to tens
of thousands of objects across 1.2 million frames from 1700 videos spanning
geospatial footprints of hundreds of kilometers, (to our knowledge) the largest
dynamic NeRF built to date.
We present qualitative initial results on a variety of tasks enabled by our
representations, including novel-view synthesis of dynamic urban scenes,
unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To
compare to prior work, we also evaluate on KITTI and Virtual KITTI 2,
surpassing state-of-the-art methods that rely on ground truth 3D bounding box
annotations while being 10x quicker to train.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - Dynamic 3D Gaussian Fields for Urban Areas [60.64840836584623]
We present an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas.
We propose 4DGF, a neural scene representation that scales to large-scale dynamic urban areas.
arXiv Detail & Related papers (2024-06-05T12:07:39Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - NVFi: Neural Velocity Fields for 3D Physics Learning from Dynamic Videos [8.559809421797784]
We propose to simultaneously learn the geometry, appearance, and physical velocity of 3D scenes only from video frames.
We conduct extensive experiments on multiple datasets, demonstrating the superior performance of our method over all baselines.
arXiv Detail & Related papers (2023-12-11T14:07:31Z) - STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in
Motion with Neural Rendering [9.600908665766465]
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation.
We show that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes.
arXiv Detail & Related papers (2020-12-22T23:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.