Fugu-MT 論文翻訳(概要): From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

論文の概要: From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

arxiv url: http://arxiv.org/abs/2605.12167v1
Date: Tue, 12 May 2026 14:15:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.910691
Title: From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
Title（参考訳）: 想像から実行可能行動へ:ロボット操作における潜伏行動の混合
Authors: Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang,
Abstract要約: 将来の映像を実行可能な表現に変換する制御指向インタフェースであるMoLAを提案する。我々は,シミュレーションベンチマークと実世界のロボット操作タスクに対するアプローチを評価した。
参考スコア（独自算出の注目度）: 88.39072412680633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
Abstract（参考訳）: ビデオ生成モデルは、長期の将来の観測を予測してロボット操作に有望な想像力を与えるが、これらの想像された未来をアクション実行に効果的に活用することは依然として困難である。既存のアプローチは、予測されたフレームの条件ポリシーか、生成したビデオをアクションへと直接デコードする。その結果、予測された観察は、状態遷移のアクション中心の原因よりも知覚の忠実さを強調し、間接的かつ不安定な制御につながった。このギャップに対処するために,将来の映像を実行可能な表現に変換する制御指向インタフェースであるMoLA(Mixture of Latent Actions)を提案する。予測されたフレームを直接ポリシーに渡す代わりに、MoLAは事前訓練された逆動力学モデルの混合を利用して、生成された視覚遷移によって引き起こされる潜伏作用の混合を推測する。これらのモダリティを意識した逆ダイナミクスモデルは相補的意味論、深さ、フローキューを捉え、ビデオの想像力とポリシーの実行を橋渡しする構造的かつ物理的に基盤付けられたアクション表現を提供する。我々は,シミュレーションベンチマーク(LIBERO, CALVIN, LIBERO-Plus)と実世界のロボット操作タスクに対するアプローチを評価し,タスク成功,時間的一貫性,一般化における一貫した向上を実現した。

論文の概要: From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

関連論文リスト