Fugu-MT 論文翻訳(概要): SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

論文の概要: SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

arxiv url: http://arxiv.org/abs/2603.11563v1
Date: Thu, 12 Mar 2026 05:35:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.912421
Title: SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
Title（参考訳）: SVLL:身体的タスク計画のための段階的視覚言語学習
Authors: Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren,
Abstract要約: 我々は、堅牢で物理的に具体化された計画のための3段階統合フレームワークであるSVLL(Staged Vision-Language Learning)を提案する。最初の2段階では、SVLLは時間的推論から空間的グラウンドを分離し、シーケンシャルなアクション履歴を導入する前に、堅牢な視覚的依存を確立する。最終段階では、標準の直接選好最適化(DPO)の重要な制限、すなわち純粋に相対的な性質を識別し、勝利と敗戦の選好ギャップのみを最適化する。
参考スコア（独自算出の注目度）: 21.113678610046453
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
Abstract（参考訳）: Embodied Task Planningは視覚言語モデルに対して、視覚的に接地し、時間とともに因果的に一貫性のあるアクションシーケンスを生成するように要求する。しかし、既存のトレーニングパラダイムは重要なトレードオフに直面している: 共同エンドツーエンドのトレーニングは、しばしば早期の時間的拘束につながるが、標準的な強化学習手法は、最適化の不安定さに悩まされる。このギャップを埋めるために、我々は、堅牢で物理的に具体化された計画のための統合された3段階のフレームワークであるSVLL(Staged Vision-Language Learning)を紹介した。最初の2段階では、SVLLは時間的推論から空間的グラウンドを分離し、シーケンシャルなアクション履歴を導入する前に、堅牢な視覚的依存を確立する。最終段階では、標準の直接選好最適化(DPO)の重要な制限、すなわち純粋に相対的な性質、すなわち、最適経路における絶対的絶対的制約を無視しながら、勝利と損失の軌道間の選好ギャップのみを最適化し、しばしば安全でない、あるいは幻覚的な振る舞いをもたらす。そこで本研究では,本研究の目的であるBias-DPOを導入する。これは,過信の幻覚を具現化しつつ,地道行動の可能性を明示的に最大化し,専門家の軌道に誘導バイアスを注入する新たなアライメント対象である。専門家の多様体にポリシーを固定し、因果不整合を緩和することにより、Bias-DPOを動力とするSVLLは、環境条件の厳格な遵守を確保し、物理的に不可能なショートカットを効果的に抑制する。最後に、対話型AI2-THORベンチマークと実世界のロボット展開に関する広範な実験により、SVLLは最先端のオープンソース(例えば、Qwen2.5-VL-7B)とクローズドソースモデル(例えば、GPT-4o、Gemini-2.0-flash)の両方をタスク成功率で上回り、物理的制約違反を著しく低減することを示した。

論文の概要: SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

関連論文リスト