SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
- URL: http://arxiv.org/abs/2510.16714v2
- Date: Tue, 21 Oct 2025 07:24:30 GMT
- Title: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
- Authors: Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang,
- Abstract summary: This paper bridges the gap by presenting a novel framework for grounded question-answering in 3D scenes.<n>We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems.<n>To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning.
- Score: 26.897741358707396
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
Related papers
- PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning [82.55361351483005]
We present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data.<n>By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth.
arXiv Detail & Related papers (2026-02-27T11:47:45Z) - AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models [20.05010202296243]
We introduce Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, its location, motion type, and motion axis.<n>We propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm.<n>AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
arXiv Detail & Related papers (2025-11-13T06:43:00Z) - Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views [41.05815610513033]
3DThinker is a framework that exploits the rich geometric information embedded within images while reasoning, like humans do.<n>Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training.
arXiv Detail & Related papers (2025-10-21T13:36:58Z) - SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z) - Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation [50.81551581148339]
We introduce Relevant Reasoning (R$2$S), a reasoning-based segmentation framework.<n>We also introduce 3D ReasonSeg, a reasoning-based segmentation dataset.<n>Both experiments demonstrate that the R$2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities.
arXiv Detail & Related papers (2025-06-29T06:58:08Z) - Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We propose a 3D reasoning segmentation task for reasoning segmentation with multiple objects in scenes.<n>The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.<n>In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities [23.18281583681258]
We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason.
ScanReason provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding.
A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference.
arXiv Detail & Related papers (2024-07-01T17:59:35Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.