Fugu-MT 論文翻訳(概要): Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

論文の概要: Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

arxiv url: http://arxiv.org/abs/2510.18632v1
Date: Tue, 21 Oct 2025 13:36:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.628639
Title: Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Title（参考訳）: 3Dで考える:限られた視点から見る幾何学的イマジネーション
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang,
Abstract要約: 3DThinkerは、画像に埋め込まれたリッチな幾何学的情報を、人間のように推論しながら活用するフレームワークだ。私たちのフレームワークは,3D事前入力を使わずに推論中に初めて3Dのメンタリングを可能にするもので,トレーニングのために明示的にラベル付けされた3Dデータに頼らない。
参考スコア（独自算出の注目度）: 41.05815610513033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.
Abstract（参考訳）: 近年の視覚言語モデル(VLM)の進歩は多モーダルなタスクにおいて顕著な進歩を遂げているが、限られた視点から3次元空間関係を理解することは大きな課題である。従来の推論手法は通常、純粋なテキスト(トポロジカル認知地図など)や2次元視覚的手がかりに頼っている。しかし、その限られた表現能力は、3次元空間的想像力を必要とする特定のタスクのパフォーマンスを妨げる。この制限に対処するために,人間のように推論しながら画像内に埋め込まれたリッチな幾何学的情報を効果的に活用するフレームワークである3DThinkerを提案する。私たちのフレームワークは,3D事前入力を使わずに推論中に初めて3Dのメンタリングを可能にするもので,トレーニングのために明示的にラベル付けされた3Dデータに頼らない。具体的には、トレーニングは2つのステージで構成されます。まず,VLMが生成する3D潜伏剤を3次元基礎モデル(例えばVGGT)と組み合わせて調整する指導訓練を行う。そして,結果信号のみに基づく推論軌道全体を最適化し,基礎となる3次元思考を精査する。複数のベンチマークにわたる大規模な実験により、3DThinkerは強いベースラインを一貫して上回り、3D表現をマルチモーダルな推論に統一する新たな視点を提供する。私たちのコードはhttps://github.com/zhangquanchen/3DThinker.comで公開されます。

論文の概要: Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

関連論文リスト