Memorize What Matters: Emergent Scene Decomposition from Multitraverse
- URL: http://arxiv.org/abs/2405.17187v2
- Date: Wed, 29 May 2024 23:32:23 GMT
- Title: Memorize What Matters: Emergent Scene Decomposition from Multitraverse
- Authors: Yiming Li, Zehong Wang, Yue Wang, Zhiding Yu, Zan Gojcic, Marco Pavone, Chen Feng, Jose M. Alvarez,
- Abstract summary: We introduce 3D Gaussian Mapping, a camera-only offline mapping framework grounded in 3D Gaussian Splatting.
3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation.
We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering.
- Score: 54.487589469432706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 2D segmentation and 3D mapping without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.
Related papers
- Volumetric Mapping with Panoptic Refinement via Kernel Density Estimation for Mobile Robots [2.8668675011182967]
Mobile robots usually use lightweight networks to segment objects on RGB images and then localize them via depth maps.
We address the problem of panoptic segmentation quality in 3D scene reconstruction by refining segmentation errors using non-parametric statistical methods.
We map the predicted masks into a depth frame to estimate their distribution via kernel densities.
The outliers in depth perception are then rejected without the need for additional parameters.
arXiv Detail & Related papers (2024-12-15T16:46:23Z) - RIGI: Rectifying Image-to-3D Generation Inconsistency via Uncertainty-aware Learning [27.4552892119823]
inconsistencies in multi-view snapshots frequently introduce noise and artifacts along object boundaries, undermining the 3D reconstruction process.
We leverage 3D Gaussian Splatting (3DGS) for 3D reconstruction, and explicitly integrate uncertainty-aware learning into the reconstruction process.
We apply adaptive pixel-wise loss weighting to regularize the models, reducing reconstruction intensity in high-uncertainty regions.
arXiv Detail & Related papers (2024-11-28T02:19:28Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to
the Third Dimension [71.71234436165255]
We contribute DensePose 3D, a method that can learn such reconstructions in a weakly supervised fashion from 2D image annotations only.
Because it does not require 3D scans, DensePose 3D can be used for learning a wide range of articulated categories such as different animal species.
We show significant improvements compared to state-of-the-art non-rigid structure-from-motion baselines on both synthetic and real data on categories of humans and animals.
arXiv Detail & Related papers (2021-08-31T18:33:55Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z) - 3D Human Mesh Regression with Dense Correspondence [95.92326689172877]
Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction.
Prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing.
This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space.
arXiv Detail & Related papers (2020-06-10T08:50:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.