Semantic MapNet: Building Allocentric Semantic Maps and Representations
from Egocentric Views
- URL: http://arxiv.org/abs/2010.01191v3
- Date: Thu, 11 Mar 2021 00:26:51 GMT
- Title: Semantic MapNet: Building Allocentric Semantic Maps and Representations
from Egocentric Views
- Authors: Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa,
Dhruv Batra
- Abstract summary: We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment.
We build an allo top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose.
We present SemanticMapNet (SMNet), which combines the strengths of projective camera geometry and neural representation learning.
- Score: 50.844459908504476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the task of semantic mapping - specifically, an embodied agent (a
robot or an egocentric AI assistant) is given a tour of a new environment and
asked to build an allocentric top-down semantic map ("what is where?") from
egocentric observations of an RGB-D camera with known pose (via localization
sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
(2) a Feature Projector that projects egocentric features to appropriate
locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
length x width x feature-dims that learns to accumulate projected egocentric
features, and (4) a Map Decoder that uses the memory tensor to produce semantic
top-down maps. SMNet combines the strengths of (known) projective camera
geometry and neural representation learning. On the task of semantic mapping in
the Matterport3D dataset, SMNet significantly outperforms competitive baselines
by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
metrics. Moreover, we show how to use the neural episodic memories and
spatio-semantic allocentric representations build by SMNet for subsequent tasks
in the same space - navigating to objects seen during the tour("Find chair") or
answering questions about the space ("How many chairs did you see in the
house?"). Project page: https://vincentcartillier.github.io/smnet.html.
Related papers
- 3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D [16.436661725188962]
We study the task of 3D multi-object re-identification from embodied tours.
We present 3D Semantic MapNet - a two-stage re-identification model consisting of a 3D object detector that operates on RGB-D videos with known pose, and a differentiable object matching module.
Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours.
arXiv Detail & Related papers (2024-03-19T23:01:14Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Neural Implicit Dense Semantic SLAM [83.04331351572277]
We propose a novel RGBD vSLAM algorithm that learns a memory-efficient, dense 3D geometry, and semantic segmentation of an indoor scene in an online manner.
Our pipeline combines classical 3D vision-based tracking and loop closing with neural fields-based mapping.
Our proposed algorithm can greatly enhance scene perception and assist with a range of robot control problems.
arXiv Detail & Related papers (2023-04-27T23:03:52Z) - Object-level 3D Semantic Mapping using a Network of Smart Edge Sensors [25.393382192511716]
We extend a multi-view 3D semantic mapping system consisting of a network of distributed edge sensors with object-level information.
Our method is evaluated on the public Behave dataset where it shows pose estimation within a few centimeters and in real-world experiments with the sensor network in a challenging lab environment.
arXiv Detail & Related papers (2022-11-21T11:13:08Z) - Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images
to Allocentric Semantics with Vision Transformers [34.6312362205904]
We propose an end-to-end one-stage Transformer-based framework for Mapping, termed Trans4Map.
Trans4Map achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a +3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset.
arXiv Detail & Related papers (2022-07-13T14:01:00Z) - Episodic Memory Question Answering [55.83870351196461]
We envision a scenario wherein the human communicates with an AI agent powering an augmented reality device by asking questions.
In order to succeed, the ego AI assistant must construct semantically rich and efficient scene memories.
We introduce a new task - Episodic Memory Question Answering (EMQA)
We show that our choice of episodic scene memory outperforms naive, off-the-centric solutions for the task as well as a host of very competitive baselines.
arXiv Detail & Related papers (2022-05-03T17:28:43Z) - HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization [83.57863764231655]
We propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization.
A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints.
We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets.
arXiv Detail & Related papers (2020-07-17T12:44:23Z) - Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment [47.96314050446863]
We build a system to creat a semantic 3D map by combining 3D point cloud from ORB SLAM with semantic segmentation information from PSPNet-101 for large-scale environments.
We find a way to associate the real-world landmark with point cloud map and built a topological map based on semantic map.
arXiv Detail & Related papers (2020-01-04T03:34:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.