Fugu-MT 論文翻訳(概要): SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

論文の概要: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

arxiv url: http://arxiv.org/abs/2510.16714v2
Date: Tue, 21 Oct 2025 07:24:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:11.846575
Title: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Title（参考訳）: SceneCOT:3Dシーンでグラウンドド・オブ・サート・リソンを回避
Authors: Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang,
Abstract要約: 本稿では,3次元シーンにおけるグラウンドド質問応答のための新しい枠組みを提示することによって,そのギャップを埋める。まず,3次元シーン(SCENECOT)において,複雑な推論タスクをシンプルかつ管理可能な問題に分解する。私たちの知る限りでは、これはCoT推論の3Dシーン理解への最初の成功例であり、ステップバイステップのヒューマンライクな推論を可能にします。
参考スコア（独自算出の注目度）: 26.897741358707396
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.
Abstract（参考訳）: 既存の3次元大規模言語モデル(LLMs)の研究は、主に人間のようなシーン対象の背景推論のメカ・アニミズムの探索が不十分なため、根拠付き質問応答の達成に苦慮している。本稿では,新しい枠組みを提示することによってギャップを埋める。まず,3次元シーンにおいて,複雑な推論タスクをシンプルかつ管理可能な問題に分解し,マルチモーダル・エキスパート・モジュールをベースとした視覚的手がかりを構築する。 SCENECOT-185Kは185Kの高品質なインスタンスからなる最初の大規模基底CoT推論データセットである。様々な複雑な3Dシーン推論ベンチマークによる大規模な実験により、我々の新しいフレームワークは、高基底QAコヒーレンスで高い性能を達成することを示した。私たちの知る限りでは、これはCoT推論の3Dシーン理解への最初の成功例であり、ステップバイステップのヒューマンライクな推論を可能にし、より広い3Dシーン理解シナリオの拡張の可能性を示している。

論文の概要: SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

関連論文リスト