Fugu-MT 論文翻訳(概要): Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

論文の概要: Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

arxiv url: http://arxiv.org/abs/2602.13193v2
Date: Mon, 02 Mar 2026 23:14:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.445232
Title: Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Title（参考訳）: 身体的推論と階層制御のためのステアブルビジョン・ランゲージ・アクション・ポリシー
Authors: William Chen, Jagdeep Singh Bhatia, Catherine Glossop, Nikhil Mathihalli, Ria Doshi, Andy Tang, Danny Driess, Karl Pertsch, Sergey Levine,
Abstract要約: Steerable Policies: サブタスクやモーション,接地したピクセル座標など,さまざまな抽象化レベルで,リッチな合成コマンドに基づいてトレーニングされたVLA。この利点は、学習した高レベルな具体的推論器と既製のVLMの両方を使って、コンテキスト内学習を通じてコマンドの抽象化を推論することで実証する。
参考スコア（独自算出の注目度）: 46.169163284648384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
Abstract（参考訳）: 事前訓練された視覚言語モデル(VLM)は、さまざまな設定にまたがって意味論的および視覚的推論を作成でき、ロボット制御に有用な常識的事前情報を提供する。しかし、この知識をロボット行動に効果的に根ざすことは、依然としてオープンな課題である。従来の手法では、VLMが高レベルなコマンドに対して異なる低レベルなポリシー(例えば、視覚言語アクションモデル(VLA))によって実行されることを理由とする階層的なアプローチを用いることが多い。 VLMとVLAのインターフェイスは通常自然言語のタスク命令であり、VLMの推論が低レベルな振る舞いをいかに抑えるかは基本的に制限される。 VLAは、サブタスク、モーション、接地されたピクセル座標など、様々なレベルの抽象レベルで、リッチな合成コマンドに基づいて訓練される。低レベルの制御性を改善することで、ステアブル・ポリシーはVLMの事前訓練された知識を解放し、タスクの一般化を改善することができる。この利点は、学習した高レベルな具体的推論器と既製のVLMの両方を使って、コンテキスト内学習を通じてコマンドの抽象化を推論することで実証する。大規模な実世界の操作実験を通じて、これらの2つの新しい手法は、挑戦的な一般化と長期水平タスクを含む、VLAとVLMに基づく階層的ベースラインの事前の具体的推論よりも優れている。公式サイト:steerable-policies.github.io

論文の概要: Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

関連論文リスト