3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D
- URL: http://arxiv.org/abs/2403.13190v1
- Date: Tue, 19 Mar 2024 23:01:14 GMT
- Title: 3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D
- Authors: Vincent Cartillier, Neha Jain, Irfan Essa,
- Abstract summary: We study the task of 3D multi-object re-identification from embodied tours.
We present 3D Semantic MapNet - a two-stage re-identification model consisting of a 3D object detector that operates on RGB-D videos with known pose, and a differentiable object matching module.
Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours.
- Score: 16.436661725188962
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the task of 3D multi-object re-identification from embodied tours. Specifically, an agent is given two tours of an environment (e.g. an apartment) under two different layouts (e.g. arrangements of furniture). Its task is to detect and re-identify objects in 3D - e.g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second. To support this task, we create an automated infrastructure to generate paired egocentric tours of initial/modified layouts in the Habitat simulator using Matterport3D scenes, YCB and Google-scanned objects. We present 3D Semantic MapNet (3D-SMNet) - a two-stage re-identification model consisting of (1) a 3D object detector that operates on RGB-D videos with known pose, and (2) a differentiable object matching module that solves correspondence estimation between two sets of 3D bounding boxes. Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours. After training 3D-SMNet on our generated episodes, we demonstrate zero-shot transfer to real-world rearrangement scenarios by instantiating our task in Replica, Active Vision, and RIO environments depicting rearrangements. On all datasets, we find 3D-SMNet outperforms competitive baselines. Further, we show jointly training on real and generated episodes can lead to significant improvements over training on real data alone.
Related papers
- ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D
Scene Reconstruction From A Single Image [33.126045619754365]
BUOL is a framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image.
Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D.
arXiv Detail & Related papers (2023-06-01T17:56:49Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Single-view 3D Mesh Reconstruction for Seen and Unseen Categories [69.29406107513621]
Single-view 3D Mesh Reconstruction is a fundamental computer vision task that aims at recovering 3D shapes from single-view RGB images.
This paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories.
We propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction.
arXiv Detail & Related papers (2022-08-04T14:13:35Z) - Roominoes: Generating Novel 3D Floor Plans From Existing 3D Rooms [22.188206636953794]
We propose the task of generating novel 3D floor plans from existing 3D rooms.
One uses available 2D floor plans to guide selection and deformation of 3D rooms; the other learns to retrieve a set of compatible 3D rooms and combine them into novel layouts.
arXiv Detail & Related papers (2021-12-10T16:17:01Z) - DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to
the Third Dimension [71.71234436165255]
We contribute DensePose 3D, a method that can learn such reconstructions in a weakly supervised fashion from 2D image annotations only.
Because it does not require 3D scans, DensePose 3D can be used for learning a wide range of articulated categories such as different animal species.
We show significant improvements compared to state-of-the-art non-rigid structure-from-motion baselines on both synthetic and real data on categories of humans and animals.
arXiv Detail & Related papers (2021-08-31T18:33:55Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [29.02523358573336]
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene.
Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay.
arXiv Detail & Related papers (2020-11-06T14:08:27Z) - 3D Object Recognition By Corresponding and Quantizing Neural 3D Scene
Representations [29.61554189447989]
We propose a system that learns to detect objects and infer their 3D poses in RGB-D images.
Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations.
arXiv Detail & Related papers (2020-10-30T13:56:09Z) - Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping [23.456046776979903]
We propose to leverage multiview data of textitstatic points in arbitrary scenes (static or dynamic) to learn a neural 3D mapping module.
The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output.
We show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers.
arXiv Detail & Related papers (2020-08-04T02:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.