Related papers: ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

URL: http://arxiv.org/abs/2407.01525v3
Date: Wed, 17 Jul 2024 07:07:43 GMT
Title: ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
Authors: Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu,
Abstract summary: We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason. ScanReason provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference.
Score: 23.18281583681258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

Related papers

3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding [11.069512983766783]
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks.<n>We propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs.<n>Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks.
arXiv Detail & Related papers (2025-07-31T11:59:06Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding [44.82926606018167]
3D Visual Grounding aims to localize target objects within a 3D scene based on natural language queries.<n>In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework.<n>Experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2025-06-27T05:34:57Z)
Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding [29.035369822597218]
Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint.<n>Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model.<n>In this study, we assess the grounding performance of detection models using predicted boxes filtered by the target category.
arXiv Detail & Related papers (2025-06-05T16:11:57Z)
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning [25.006301640162846]
We introduce a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning.
arXiv Detail & Related papers (2025-04-28T17:48:43Z)
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions. Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals. We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention [12.203336176170982]
D-LISA is a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction.
arXiv Detail & Related papers (2024-10-29T17:52:20Z)
Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands. We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z)
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
This paper introduces Reason3D, a novel large language model for comprehensive 3D understanding. We propose a hierarchical mask decoder to locate small objects within expansive scenes. Experiments validate that Reason3D achieves remarkable results on large-scale ScanNet and Matterport3D datasets.
arXiv Detail & Related papers (2024-05-27T17:59:41Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. We propose to leverage weakly supervised annotations to learn the 3D visual grounding model. We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z)
3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA) This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions. We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z)
From 2D to 3D: Re-thinking Benchmarking of Monocular Depth Prediction [80.67873933010783]
We argue that MDP is currently witnessing benchmark over-fitting and relying on metrics that are only partially helpful to gauge the usefulness of the predictions for 3D applications. This limits the design and development of novel methods that are truly aware of - and improving towards estimating - the 3D structure of the scene rather than optimizing 2D-based distances. We propose a set of metrics well suited to evaluate the 3D geometry of MDP approaches and a novel indoor benchmark, RIO-D3D, crucial for the proposed evaluation methodology.
arXiv Detail & Related papers (2022-03-15T17:50:54Z)
Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation. It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.