Fugu-MT 論文翻訳(概要): Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

論文の概要: Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

arxiv url: http://arxiv.org/abs/2508.09822v1
Date: Wed, 13 Aug 2025 13:54:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.915642
Title: Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
Title（参考訳）: 動作事前訓練を伴わないロボットマニピュレーションのための物理自己回帰モデル
Authors: Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang,
Abstract要約: 我々は、自己回帰ビデオ生成モデルを構築し、物理自己回帰モデル(PAR)を提案する。 PARは、アクション事前トレーニングを必要とせず、物理力学を理解するために、ビデオ事前トレーニングに埋め込まれた世界の知識を活用する。 ManiSkillベンチマークの実験は、PARがPushCubeタスクで100%の成功率を達成したことを示している。
参考スコア（独自算出の注目度）: 62.045786177492495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.
Abstract（参考訳）: 操作データの不足は、ロボット工学における他のモダリティからの事前訓練された大きなモデルの使用を動機付けている。本研究では,自動回帰ビデオ生成モデルを構築し,物理トークンがフレームとアクションを組み合わせてロボットとその環境の協調進化を表現する物理自己回帰モデル(PAR)を提案する。 PARは、アクション事前学習を必要とせず、物理力学を理解するために、ビデオ事前学習に埋め込まれた世界的知識を活用し、正確なビデオ予測と一貫した行動軌跡を可能にする。また、フレームとアクションを連続トークンとしてモデル化し、量子化エラーを軽減し、相互強化を容易にするために、DiTベースのデトケナイザも採用している。さらに、逆キネマティクス、並列トレーニング、KV-cache機構を備えた因果マスクを組み込んで、パフォーマンスと効率をさらに向上する。 ManiSkillベンチマークの実験によると、PARはPushCubeタスクで100倍の成功率を獲得し、他のタスクでのアクション事前ベースラインのパフォーマンスと一致し、厳密に整列されたアクショントラジェクトリで将来のビデオを正確に予測する。これらの知見は、自己回帰ビデオプレトレーニングから世界知識を移すことによって、ロボット操作の有望な方向性を示す。

論文の概要: Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

関連論文リスト