Fugu-MT 論文翻訳(概要): DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

論文の概要: DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

arxiv url: http://arxiv.org/abs/2603.16860v1
Date: Tue, 17 Mar 2026 17:59:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.923875
Title: DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
Title（参考訳）: DreamPlan: ビデオワールドモデルによるビジョンランゲージプランナーの効率的な強化
Authors: Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang,
Abstract要約: 視覚言語モデル(VLM)の強化微調整のためのフレームワークであるDreamPlanを紹介する。コストのかかる物理的ロールアウトに頼る代わりに、DreamPlanはまずゼロショットのVLMを利用してインタラクションデータを収集する。これらの仮想ロールアウトを利用することで、物理およびタスク固有の知識をVLMに効率的に注入する。
参考スコア（独自算出の注目度）: 17.14390355735799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.
Abstract（参考訳）: ロボット操作は、大規模ビジョンランゲージモデル(VLM)によって自然に保持される高度なコモンセンス推論を必要とする。 VLMはゼロショットプランナーとして有望であるが、その基盤となる物理的理解の欠如は、複雑な現実世界環境、特に変形可能なオブジェクト操作のような困難なタスクにおいて、複雑なエラーと成功率の低下につながることが多い。強化学習(Reinforcement Learning, RL)は、これらのプランナーを特定のタスクダイナミクスに適応させることができるが、実世界のインタラクションを介して直接微調整するVLMは、高価で、安全ではない、サンプル非効率である。このボトルネックを克服するために,ビデオワールドモデルによるVLMプランナの微調整のための新しいフレームワークであるDreamPlanを紹介した。コストのかかる物理ロールアウトに頼る代わりに、DreamPlanはまずゼロショットのVLMを利用して探索的なインタラクションデータを収集する。この準最適データは、複雑な実世界の物理を暗黙的に捉えるアクション条件付きビデオ生成モデルを訓練するのに十分であることを示す。その後、VLMプランナーはOdds Ratio Policy Optimization (ORPO)を用いて、このビデオワールドモデルの「想像」内で完全に微調整される。これらの仮想ロールアウトを利用することで、物理およびタスク固有の知識をVLMに効率的に注入する。以上の結果から,DreamPlanは意味論的推論と物理的根拠のギャップを埋め,大規模な実世界のデータ収集を必要とせずに操作の成功率を大幅に改善することを示した。私たちのプロジェクトページはhttps://psi-lab.ai/DreamPlan/です。

論文の概要: DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

関連論文リスト