ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding
- URL: http://arxiv.org/abs/2303.13186v1
- Date: Thu, 23 Mar 2023 11:36:14 GMT
- Title: ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding
- Authors: Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao
Shen
- Abstract summary: Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
- Score: 67.21613160846299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aiming to link natural language descriptions to specific regions in a 3D
scene represented as 3D point clouds, 3D visual grounding is a very fundamental
task for human-robot interaction. The recognition errors can significantly
impact the overall accuracy and then degrade the operation of AI systems.
Despite their effectiveness, existing methods suffer from the difficulty of low
recognition accuracy in cases of multiple adjacent objects with similar
appearances.To address this issue, this work intuitively introduces the
human-robot interaction as a cue to facilitate the development of 3D visual
grounding. Specifically, a new task termed Embodied Reference Understanding
(ERU) is first designed for this concern. Then a new dataset called ScanERU is
constructed to evaluate the effectiveness of this idea. Different from existing
datasets, our ScanERU is the first to cover semi-synthetic scene integration
with textual, real-world visual, and synthetic gestural information.
Additionally, this paper formulates a heuristic framework based on attention
mechanisms and human body movements to enlighten the research of ERU.
Experimental results demonstrate the superiority of the proposed method,
especially in the recognition of multiple identical objects. Our codes and
dataset are ready to be available publicly.
Related papers
- Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.