Indoor Semantic Scene Understanding using Multi-modality Fusion
- URL: http://arxiv.org/abs/2108.07616v1
- Date: Tue, 17 Aug 2021 13:30:02 GMT
- Title: Indoor Semantic Scene Understanding using Multi-modality Fusion
- Authors: Muraleekrishna Gopinathan and Giang Truong and Jumana Abu-Khalaf
- Abstract summary: We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Seamless Human-Robot Interaction is the ultimate goal of developing service
robotic systems. For this, the robotic agents have to understand their
surroundings to better complete a given task. Semantic scene understanding
allows a robotic agent to extract semantic knowledge about the objects in the
environment. In this work, we present a semantic scene understanding pipeline
that fuses 2D and 3D detection branches to generate a semantic map of the
environment. The 2D mask proposals from state-of-the-art 2D detectors are
inverse-projected to the 3D space and combined with 3D detections from point
segmentation networks. Unlike previous works that were evaluated on collected
datasets, we test our pipeline on an active photo-realistic robotic environment
- BenchBot. Our novelty includes rectification of 3D proposals using projected
2D detections and modality fusion based on object size. This work is done as
part of the Robotic Vision Scene Understanding Challenge (RVSU). The
performance evaluation demonstrates that our pipeline has improved on baseline
methods without significant computational bottleneck.
Related papers
- Memorize What Matters: Emergent Scene Decomposition from Multitraverse [54.487589469432706]
We introduce 3D Gaussian Mapping, a camera-only offline mapping framework grounded in 3D Gaussian Splatting.
3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation.
We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering.
arXiv Detail & Related papers (2024-05-27T14:11:17Z) - Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds [45.87961177297602]
This work aims to integrate recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments.
Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation.
We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening.
arXiv Detail & Related papers (2024-04-18T18:01:15Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection [19.75965521357068]
We propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection) to improve the accuracy of 3D object detection.
Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP)
This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems.
arXiv Detail & Related papers (2023-08-26T07:38:21Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from
Depth Maps [66.24554680709417]
Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications.
We propose a non-invasive framework based on depth devices and deep neural networks to estimate the 3D pose of robots from an external camera.
arXiv Detail & Related papers (2022-07-06T08:52:12Z) - Ground-aware Monocular 3D Object Detection for Autonomous Driving [6.5702792909006735]
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a challenging task for low-cost urban autonomous driving and mobile robots.
Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation.
We introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning.
arXiv Detail & Related papers (2021-02-01T08:18:24Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.