A System for Generalized 3D Multi-Object Search
- URL: http://arxiv.org/abs/2303.03178v2
- Date: Tue, 18 Apr 2023 03:48:11 GMT
- Title: A System for Generalized 3D Multi-Object Search
- Authors: Kaiyu Zheng, Anirudha Paul, Stefanie Tellex
- Abstract summary: GenMOS is a general-purpose system for multi-object search in a 3D region that is robot-independent and environment-agnostic.
Our system enables, for example, a Boston Dynamics Spot robot to find a toy cat hidden underneath a couch in under one minute.
- Score: 10.40566214112389
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Searching for objects is a fundamental skill for robots. As such, we expect
object search to eventually become an off-the-shelf capability for robots,
similar to e.g., object detection and SLAM. In contrast, however, no system for
3D object search exists that generalizes across real robots and environments.
In this paper, building upon a recent theoretical framework that exploited the
octree structure for representing belief in 3D, we present GenMOS (Generalized
Multi-Object Search), the first general-purpose system for multi-object search
(MOS) in a 3D region that is robot-independent and environment-agnostic. GenMOS
takes as input point cloud observations of the local region, object detection
results, and localization of the robot's view pose, and outputs a 6D viewpoint
to move to through online planning. In particular, GenMOS uses point cloud
observations in three ways: (1) to simulate occlusion; (2) to inform occupancy
and initialize octree belief; and (3) to sample a belief-dependent graph of
view positions that avoid obstacles. We evaluate our system both in simulation
and on two real robot platforms. Our system enables, for example, a Boston
Dynamics Spot robot to find a toy cat hidden underneath a couch in under one
minute. We further integrate 3D local search with 2D global search to handle
larger areas, demonstrating the resulting system in a 25m$^2$ lobby area.
Related papers
- 3D Feature Distillation with Object-Centric Priors [9.626027459292926]
2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images.
Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific or focus on indoor room scan data.
We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency.
arXiv Detail & Related papers (2024-06-26T20:16:49Z) - Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention [86.39271731460927]
3D intention grounding is a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back"
We introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset.
We also propose IntentNet, our unique approach, designed to tackle this intention-based detection problem.
arXiv Detail & Related papers (2024-05-28T15:48:39Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI [88.03089807278188]
EmbodiedScan is a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.
It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories.
Building upon this database, we introduce a baseline framework named Embodied Perceptron.
It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities.
arXiv Detail & Related papers (2023-12-26T18:59:11Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - Generalized Object Search [0.9137554315375919]
This thesis develops methods and systems for (multi-)object search in 3D environments under uncertainty.
I implement a robot-independent, environment-agnostic system for generalized object search in 3D.
I deploy it on the Boston Dynamics Spot robot, the Kinova MOVO robot, and the Universal Robots UR5e robotic arm.
arXiv Detail & Related papers (2023-01-24T16:41:36Z) - Extracting Zero-shot Common Sense from Large Language Models for Robot
3D Scene Understanding [25.270772036342688]
We introduce a novel method for leveraging common sense embedded within large language models for labelling rooms.
The proposed algorithm operates on 3D scene graphs produced by modern spatial perception systems.
arXiv Detail & Related papers (2022-06-09T16:05:35Z) - Gait Recognition in the Wild with Dense 3D Representations and A
Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes.
This paper aims to explore dense 3D representations for gait recognition in the wild.
We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.