Unified Representation Space for 3D Visual Grounding
- URL: http://arxiv.org/abs/2506.14238v1
- Date: Tue, 17 Jun 2025 06:53:15 GMT
- Title: Unified Representation Space for 3D Visual Grounding
- Authors: Yinuo Zheng, Lipeng Gu, Honghua Chen, Liangliang Nan, Mingqiang Wei,
- Abstract summary: 3D visual grounding aims to identify objects in 3D scenes based on text descriptions.<n>Existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities.<n>The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG.
- Score: 18.652577474202015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.
Related papers
- SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum [20.206273757144547]
PGOV3D is a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation.<n>We pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry.<n>In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex.
arXiv Detail & Related papers (2025-06-30T08:13:07Z) - AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding [15.944945244005952]
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes.<n>We propose a novel 2D-assisted 3D visual grounding framework that constructs semantic-spatial scene graphs with referred object discrimination for relationship perception.
arXiv Detail & Related papers (2025-05-07T02:02:15Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [45.68105299990119]
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets.<n>We propose a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD.
arXiv Detail & Related papers (2025-03-10T17:55:22Z) - Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces [52.237827968294766]
We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance.<n>We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher.<n>Ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces.
arXiv Detail & Related papers (2025-03-07T09:51:56Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.