ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding
- URL: http://arxiv.org/abs/2303.13186v1
- Date: Thu, 23 Mar 2023 11:36:14 GMT
- Title: ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding
- Authors: Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao
Shen
- Abstract summary: Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
- Score: 67.21613160846299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aiming to link natural language descriptions to specific regions in a 3D
scene represented as 3D point clouds, 3D visual grounding is a very fundamental
task for human-robot interaction. The recognition errors can significantly
impact the overall accuracy and then degrade the operation of AI systems.
Despite their effectiveness, existing methods suffer from the difficulty of low
recognition accuracy in cases of multiple adjacent objects with similar
appearances.To address this issue, this work intuitively introduces the
human-robot interaction as a cue to facilitate the development of 3D visual
grounding. Specifically, a new task termed Embodied Reference Understanding
(ERU) is first designed for this concern. Then a new dataset called ScanERU is
constructed to evaluate the effectiveness of this idea. Different from existing
datasets, our ScanERU is the first to cover semi-synthetic scene integration
with textual, real-world visual, and synthetic gestural information.
Additionally, this paper formulates a heuristic framework based on attention
mechanisms and human body movements to enlighten the research of ERU.
Experimental results demonstrate the superiority of the proposed method,
especially in the recognition of multiple identical objects. Our codes and
dataset are ready to be available publicly.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Extracting Zero-shot Common Sense from Large Language Models for Robot
3D Scene Understanding [25.270772036342688]
We introduce a novel method for leveraging common sense embedded within large language models for labelling rooms.
The proposed algorithm operates on 3D scene graphs produced by modern spatial perception systems.
arXiv Detail & Related papers (2022-06-09T16:05:35Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.