Related papers: ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

URL: http://arxiv.org/abs/2303.13186v1
Date: Thu, 23 Mar 2023 11:36:14 GMT
Title: ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding
Authors: Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao Shen
Abstract summary: Embodied Reference Understanding (ERU) is first designed for this concern. New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
Score: 67.21613160846299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearances.To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.

Related papers

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding [44.81427860963744]
A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions.<n>We propose DenseGrounding, a novel approach designed to enhance both visual and textual semantics.<n>For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features.<n>For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions.
arXiv Detail & Related papers (2025-05-08T05:49:06Z)
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding [15.944945244005952]
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes.<n>We propose a novel 2D-assisted 3D visual grounding framework that constructs semantic-spatial scene graphs with referred object discrimination for relationship perception.
arXiv Detail & Related papers (2025-05-07T02:02:15Z)
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions [43.784500616459304]
We introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions. We propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network.
arXiv Detail & Related papers (2025-04-07T05:38:23Z)
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction [1.8124328823188356]
We focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline.
arXiv Detail & Related papers (2025-03-10T23:13:35Z)
Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments. We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z)
SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR. SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds. We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space. Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment. Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment. Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.