Fugu-MT 論文翻訳(概要): Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

論文の概要: Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

arxiv url: http://arxiv.org/abs/2606.23557v1
Date: Mon, 22 Jun 2026 16:28:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 18:13:12.912277
Title: Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
Title（参考訳）: Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
Authors: Jiho Choi, Seonho Lee, Seojeong Park, Hyunjung Shim,
Abstract要約: 本稿では,地図に基づく学習フレームワークであるDRMV3D(Dense Reward for MV3DVQA)について述べる。提案手法は,MV3D-VQAを, (i) 同中心のグローバルマップ構築, (ii)質問条件のビュー・トラジェクトリ計画, (iii) 回答予測のためのエゴセントリックグラウンドに分解する。手動のアノテーションを使わずに中間ステップを学習できるようにするために,予測地図を幾何一貫性のある擬似目標に整合させる大域的一貫性報酬と,順序付き視点選択を監督する局所軌道報酬という2つの報酬を導入する。
参考スコア（独自算出の注目度）: 38.3893077130601
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.
Abstract（参考訳）: MR3D-VQA (Multi-view 3D Visual Question Answering) では、部分的な観察をコヒーレントな3Dシーン表現に統合し、多段階空間推論のための情報的視点を選択する必要がある。しかしながら、現在のマルチモーダル LLM は、通常、スパース、応答レベルの監督で訓練され、しばしば一貫性のないクロスビュー推論と不安定なビュー選択をもたらす。本稿では, DR-MV3D(Dense Reward for MV3D-VQA)について述べる。我々のアプローチはMV3D-VQAを分解する (i)全地球地図構築 (二)質問条件の視点軌道計画、及び三解答予測のための自我中心的根拠手動のアノテーションを使わずに中間ステップを学習できるようにするために、凍結した3次元視覚基盤モデル(例えば、VGGT + SAM3)から予測マップを幾何学的に一貫性のある擬似ターゲットと整合するグローバル整合報酬と、順序付き視点選択を監督する局所軌道報酬という2つの報酬を導入する。トラジェクトリレベルのポリシー最適化(GRPO)で全パイプラインを最適化する。 MindCube、VSI-Bench、BLINK(MV)の実験では、DR-MV3Dは強力なマルチイメージベースラインよりも一貫して改善され、マルチビュー3D推論におけるプロセスレベル密集監視の有効性が裏付けられている。

論文の概要: Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

関連論文リスト