Related papers: Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

URL: http://arxiv.org/abs/2506.05199v2
Date: Tue, 24 Jun 2025 16:13:34 GMT
Title: Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding
Authors: Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, Xingping Dong,
Abstract summary: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint.<n>Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model.<n>In this study, we assess the grounding performance of detection models using predicted boxes filtered by the target category.
Score: 29.035369822597218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.

Related papers

A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding [21.59984961930343]
We introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes.<n>Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations.<n>We also propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning.
arXiv Detail & Related papers (2025-08-02T05:05:50Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.<n>Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.<n>We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z)
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities [23.18281583681258]
We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason. ScanReason provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference.
arXiv Detail & Related papers (2024-07-01T17:59:35Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
PatchContrast: Self-Supervised Pre-training for 3D Object Detection [14.493213289990962]
We introduce PatchContrast, a novel self-supervised point cloud pre-training framework for 3D object detection.<n>We show that our method outperforms existing state-of-the-art models on three commonly-used 3D detection datasets.
arXiv Detail & Related papers (2023-08-14T07:45:54Z)
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z)
An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection [72.30883544352918]
We investigate whether pseudo-labels can provide effective supervision for the baseline models under varying settings. We achieve 20.23 AP for moderate level on the KITTI-3D testing set without bells and whistles, improving the baseline model by 6.03 AP. We hope this work can provide insights for the image-based 3D detection community under a semi-supervised setting.
arXiv Detail & Related papers (2022-08-15T12:17:46Z)
Det6D: A Ground-Aware Full-Pose 3D Object Detector for Improving Terrain Robustness [1.4620086904601473]
We propose Det6D, the first full-degree-of-freedom 3D object detector without spatial and postural limitations. To predict full-degree poses, including pitch and roll, we design a ground-aware orientation branch. Experiments on various datasets demonstrate the effectiveness and robustness of our method in different terrains.
arXiv Detail & Related papers (2022-07-19T17:12:48Z)
SASA: Semantics-Augmented Set Abstraction for Point-based 3D Object Detection [78.90102636266276]
We propose a novel set abstraction method named Semantics-Augmented Set Abstraction (SASA) Based on the estimated point-wise foreground scores, we then propose a semantics-guided point sampling algorithm to help retain more important foreground points during down-sampling. In practice, SASA shows to be effective in identifying valuable points related to foreground objects and improving feature learning for point-based 3D detection.
arXiv Detail & Related papers (2022-01-06T08:54:47Z)
Utilizing Every Image Object for Semi-supervised Phrase Grounding [25.36231298036066]
Phrase grounding models localize an object in the image given a referring expression. In this paper, we study the case applying objects without labeled queries for training the semi-supervised phrase grounding. We show that our predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy by 34.9% relatively with the detection results.
arXiv Detail & Related papers (2020-11-05T04:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.