Related papers: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

URL: http://arxiv.org/abs/2510.16714v2
Date: Tue, 21 Oct 2025 07:24:30 GMT
Title: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Authors: Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang,
Abstract summary: This paper bridges the gap by presenting a novel framework for grounded question-answering in 3D scenes.<n>We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems.<n>To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning.
Score: 26.897741358707396
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

Related papers

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning [82.55361351483005]
We present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data.<n>By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth.
arXiv Detail & Related papers (2026-02-27T11:47:45Z)
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models [20.05010202296243]
We introduce Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, its location, motion type, and motion axis.<n>We propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm.<n>AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
arXiv Detail & Related papers (2025-11-13T06:43:00Z)
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views [41.05815610513033]
3DThinker is a framework that exploits the rich geometric information embedded within images while reasoning, like humans do.<n>Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training.
arXiv Detail & Related papers (2025-10-21T13:36:58Z)
SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation [50.81551581148339]
We introduce Relevant Reasoning (R$2$S), a reasoning-based segmentation framework.<n>We also introduce 3D ReasonSeg, a reasoning-based segmentation dataset.<n>Both experiments demonstrate that the R$2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities.
arXiv Detail & Related papers (2025-06-29T06:58:08Z)
Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We propose a 3D reasoning segmentation task for reasoning segmentation with multiple objects in scenes.<n>The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.<n>In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z)
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities [23.18281583681258]
We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason. ScanReason provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference.
arXiv Detail & Related papers (2024-07-01T17:59:35Z)
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. We propose to leverage weakly supervised annotations to learn the 3D visual grounding model. We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z)
SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D) Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.