Fugu-MT 論文翻訳(概要): Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

論文の概要: Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2603.12553v1
Date: Fri, 13 Mar 2026 01:33:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.833092
Title: Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
Title（参考訳）: ロボットマニピュレーションのための構造化プランナーとしての世界モデル
Authors: Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang,
Abstract要約: 本稿では、生成的世界モデルを信頼性制御のための明示的な構造化プランナーに再構成するStructVLAを提案する。我々はこの手法を,個別のトークン語彙を統一した2段階の訓練パラダイムを用いて実装する。我々の実験では、StructVLAはSimplerEnv-WidowXで75.0%、LIBEROで94.8%という高い平均成功率を達成した。
参考スコア（独自算出の注目度）: 43.5447478385855
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.
Abstract（参考訳）: 近年のVLA(Vision-Language-Action)アーキテクチャは,視覚の予測によるロボット操作を改善している。しかし、将来予測は視覚的冗長性を導入し、エラーを蓄積し、長期計画の漂流を引き起こす。一方、近年のスパース法は、高レベルのセマンティック・サブタスクや暗黙の潜伏状態を用いて視覚的視力を表すのが一般的である。これらの表現は、しばしば明示的なキネマティックな基盤を欠き、計画と低レベルの実行の整合性を弱める。そこで本研究では,生成的世界モデルを信頼性制御のための明示的な構造化プランナーに再構成するStructVLAを提案する。密集したロールアウトやセマンティックゴールの代わりに、StructVLAはスパースで物理的に意味のある構造化フレームを予測する。これらのフレームは、固有のキネマティック・キュー(例えば、グリップ遷移とキネマティック・ターンポイント)から派生したもので、タスクの進行と密接に一致した時空間的なマイルストーンをキャプチャする。我々は、まず、構造化されたフレームを予測するために世界モデルを訓練し、その後、構造化されたフォアライトを低レベルなアクションにマッピングするように最適化する。このアプローチは、明確な物理的ガイダンスとブリッジによる視覚計画とモーションコントロールを提供する。我々の実験では、StructVLAはSimplerEnv-WidowXで75.0%、LIBEROで94.8%という高い平均成功率を達成した。現実のデプロイメントでは、基本的なピック・アンド・プレイスタスクと複雑なロングホライゾンタスクの両方にわたって、信頼性の高いタスク補完と堅牢な一般化が示される。

論文の概要: Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

関連論文リスト