Fugu-MT 論文翻訳(概要): Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

論文の概要: Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

arxiv url: http://arxiv.org/abs/2606.01247v1
Date: Sun, 31 May 2026 14:00:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.473045
Title: Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
Title（参考訳）: ファウンデーションモデルは、アクティブな探索を通して目標視点を達成できるのか?
Authors: Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen,
Abstract要約: 本稿では,TVR(Target Viewpoint Reproduction)について紹介する。これは,エージェントが所定のターゲット画像に一致するまで,エージェントが3次元環境下で視点を調整する,アクティブなタスクである。評価分割では、最強のオープンソースおよびクローズドソースモデルはわずか7.8%と12.0%の成功しか達成していない。我々は、専門家軌道SFT、合理的教師付きCoT-SFT、オフラインシングルターンGRPO、オンラインマルチターンGRPOをカバーする統合TVRポストトレーニングフレームワークを構築した。
参考スコア（独自算出の注目度）: 44.119113981225404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.
Abstract（参考訳）: 人間は、アクティブな頭部と身体の動きを通して、対象画像によって指定された視点を再現することができるが、基礎モデルにおける空間的知能は、事前に収集された観察の受動的理解として研究されている。本稿では,ターゲット視点再現(TVR, Target Viewpoint Reproduction)について紹介する。これは,エージェントが所定のターゲット画像と一致するまで3次元環境下で視点を調整する,アクティブなタスクである。評価分割では、最強のオープンソースおよびクローズドソースモデルはわずか7.8%と12.0%の成功しか達成していない。細粒度分析では、オフザシェルフモデルがマルチターン視覚履歴に苦しむことと、視点再現がその場での回転よりも身体翻訳を必要とする場合のパフォーマンスが急激に低下し、空間的不一致のギャップが具体化される、という2つの一貫したボトルネックが明らかになった。このギャップを減らすために、我々は、専門家軌道SFT、合理的教師付きCoT-SFT、オフラインシングルターンGRPO、およびライブシミュレータロールアウトからのオンラインマルチターンGRPOをカバーする統合TVRポストトレーニングフレームワークを構築した。マルチターンGRPOはターゲットのマルチルームの改良を提供し、全体的な51.4%に達し、CoT監督とシングルターンGRPOはクローズドループ性能を低下させた。これらの結果から,TVRBenchは3次元環境において積極的に知覚・作用する基礎モデルの測定・訓練の場として確立された。私たちのコード、データ、モデルはhttps://github.com/aim-uofa/TVRBench.comで公開されています。

論文の概要: Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

関連論文リスト