Fugu-MT 論文翻訳(概要): VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

論文の概要: VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

arxiv url: http://arxiv.org/abs/2603.22892v1
Date: Tue, 24 Mar 2026 07:40:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.361937
Title: VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents
Title（参考訳）: VLGOR: 汎用エージェントのためのオフライン強化学習のためのビジュアル言語知識ガイド
Authors: Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu,
Abstract要約: 大規模言語モデル(LLM)と強化学習(RL)により、エージェントはタスク実行のために言語命令をより効率的に解釈できる。本稿では,視覚知識と言語知識を統合し,仮想ロールアウトを生成するフレームワークであるVisual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR)を提案する。ロボット操作ベンチマークの実験では、VLGORは新たな最適ポリシーを必要とする未確認タスクのパフォーマンスを著しく改善し、ベースライン法よりも24%以上の成功率を達成した。
参考スコア（独自算出の注目度）: 14.848584432075285
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.
Abstract（参考訳）: 大規模言語モデル(LLM)と強化学習(RL)を組み合わせることで、エージェントはタスク実行のために言語命令をより効率的に解釈できる。しかし、LLMは、環境力学の理解と、目に見えないタスクに一般化する能力を制限する物理的環境に対する直接的な認識を欠いているのが一般的である。この制限に対処するため,視覚知識と言語知識を統合し,仮想ロールアウトを生成し,インタラクションデータを強化するフレームワークであるVisual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR)を提案する。 VLGORの中核となる前提は、視覚言語モデルを微調整し、初期視覚観察と高レベルの指示で条件付けられた将来の状態と動作を予測し、生成したロールアウトが時間的コヒーレントで空間的に妥当であることを保証することである。さらに、オフラインのRLトレーニングにおいて、より多様なロールアウトを生成するために、対実的なプロンプトを用いて、視覚的手がかりに基づいて環境に接地しながら、後続の言語指示を容易にする知識をエージェントが取得できるようにする。ロボット操作ベンチマークの実験では、VLGORは新たな最適ポリシーを必要とする未確認タスクのパフォーマンスを著しく改善し、ベースライン法よりも24%以上の成功率を達成した。

論文の概要: VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

関連論文リスト