ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
- URL: http://arxiv.org/abs/2503.23297v1
- Date: Sun, 30 Mar 2025 03:40:35 GMT
- Title: ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
- Authors: Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, Xiangyang Xue,
- Abstract summary: Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.<n>Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.<n>We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
- Score: 68.4209681278336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
Related papers
- Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding [31.40722103849691]
MPEC is a novel learning method for open-vocabulary 3D semantic segmentation.
It uses both 3D entity-language alignment and point-entity consistency across different point cloud views.
Our method achieves state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation.
arXiv Detail & Related papers (2025-04-28T05:43:14Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.<n>We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.<n>We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding [42.750252190275546]
LangSurf is a language-Embedded Surface Field that aligns 3D language fields with the surface of objects.<n>Our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing.
arXiv Detail & Related papers (2024-12-23T15:12:20Z) - 3D Spatial Understanding in MLLMs: Disambiguation and Evaluation [13.614206918726314]
We propose techniques to enhance the model's ability to localize and disambiguate target objects.<n>Our approach achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity.
arXiv Detail & Related papers (2024-12-09T16:04:32Z) - 3D Scene Graph Guided Vision-Language Pre-training [11.131667398927394]
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions.<n>Existing approaches typically follow task-specific, highly specialized paradigms.<n>This paper proposes a 3D scene graph-guided vision-language pre-training framework.
arXiv Detail & Related papers (2024-11-27T16:10:44Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation [30.429893959096752]
We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation.
We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
arXiv Detail & Related papers (2022-01-26T07:43:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.