3D-DRES: Detailed 3D Referring Expression Segmentation
- URL: http://arxiv.org/abs/2603.02896v1
- Date: Tue, 03 Mar 2026 11:45:54 GMT
- Title: 3D-DRES: Detailed 3D Referring Expression Segmentation
- Authors: Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Liujuan Cao,
- Abstract summary: We introduce Detailed 3D Referring Expression (3D-DRES), a new task that provides a phrase to 3D instance mapping.<n>We present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects.<n>Our experimental results demonstrate that models trained on DetailRefer excel at phrase-level segmentation and show surprising improvements on traditional 3D-RES benchmarks.
- Score: 53.88273255459736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.
Related papers
- ReferSplat: Referring Segmentation in 3D Gaussian Splatting [60.73702075842278]
Referring 3D Gaussian Splatting (R3DGS)<n>Task aims to segment target objects in a 3D Gaussian scene based on natural language descriptions.<n>To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions.
arXiv Detail & Related papers (2025-08-11T17:59:30Z) - Segment Any 3D-Part in a Scene from a Sentence [50.46950922754459]
This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions.<n>We introduce the 3D-PU dataset, the first large-scale 3D dataset with dense part annotations.<n>On the methodological side, we propose OpenPart3D, a 3D-input-only framework to tackle the challenges of part-level segmentation.
arXiv Detail & Related papers (2025-06-24T05:51:22Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - 3D-GRES: Generalized 3D Referring Expression Segmentation [77.10044505645064]
3D Referring Expression (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description.
Generalized 3D Referring Expression (3D-GRES) extends the capability to segment any number of instances based on natural language instructions.
arXiv Detail & Related papers (2024-07-30T08:59:05Z) - RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects.
We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands.
We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z) - Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks.<n>We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes.
arXiv Detail & Related papers (2024-05-27T17:59:41Z) - PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model [19.333506797686695]
We introduce a novel segmentation task known as reasoning part segmentation for 3D objects.
We output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object.
We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations.
arXiv Detail & Related papers (2024-04-04T23:38:45Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Toward Explainable and Fine-Grained 3D Grounding through Referring
Textual Phrases [35.18565109770112]
3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases.
By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario.
Results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
arXiv Detail & Related papers (2022-07-05T05:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.