Fugu-MT 論文翻訳(概要): Latent Action Pretraining Through World Modeling

論文の概要: Latent Action Pretraining Through World Modeling

arxiv url: http://arxiv.org/abs/2509.18428v1
Date: Mon, 22 Sep 2025 21:19:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.593827
Title: Latent Action Pretraining Through World Modeling
Title（参考訳）: 世界モデリングによる潜在行動予知
Authors: Bahey Tharwat, Yara Nasser, Ali Abouzeid, Ian Reid,
Abstract要約: 自己教師型手法で模倣学習モデルを事前学習するためのモデルに依存しないフレームワークであるLAWMを提案する。当社のフレームワークは,タスクや環境,実施環境の移動に有効であるように設計されています。
参考スコア（独自算出の注目度）: 1.988007188564225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $\pi_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is designed to be effective for transferring across tasks, environments, and embodiments. It outperforms models trained with ground-truth robotics actions and similar pretraining methods on the LIBERO benchmark and real-world setup, while being significantly more efficient and practical for real-world settings.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、言語命令に従うロボット操作タスクを学ぶことで人気を集めている。 OpenVLAや$\pi_{0}$のような最先端のVLAは、遠隔操作を通じて収集された大規模で手動でラベル付けされたアクションデータセットでトレーニングされた。 LAPAやVilla-Xといった最近のアプローチでは、フレーム間の抽象的な視覚的変化をモデル化することで、ラベル付きデータセット上で教師なしの事前トレーニングを可能にする潜在アクション表現が導入されている。これらの手法は強い結果を示しているが、その大きなモデルサイズは、実際の環境でのデプロイメントを困難にしている。本研究では、ラベルなしビデオデータから潜在動作表現を学習し、自己教師付きで模倣学習モデルを事前学習するためのモデルに依存しないフレームワークであるLAWMを提案する。これらのビデオは、ロボットの録画や、人間が日常の物体で行動するビデオから得られる。当社のフレームワークは,タスクや環境,実施環境の移動に有効であるように設計されています。 LIBEROベンチマークや実世界の設定において、地上の真剣なロボティクスアクションや同様の事前訓練手法で訓練されたモデルよりも優れており、実際の設定でははるかに効率的で実用的だ。

論文の概要: Latent Action Pretraining Through World Modeling

関連論文リスト