Fugu-MT 論文翻訳(概要): Action-Effect Memory Pretraining for Robot Manipulation

論文の概要: Action-Effect Memory Pretraining for Robot Manipulation

arxiv url: http://arxiv.org/abs/2606.12499v1
Date: Wed, 10 Jun 2026 13:58:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.370743
Title: Action-Effect Memory Pretraining for Robot Manipulation
Title（参考訳）: ロボットマニピュレーションのためのアクション・エフェクト・メモリ・プレトレーニング
Authors: Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu,
Abstract要約: 本稿では,ロボット操作のためのアクション・エフェクト・メモリ事前学習フレームワークであるAEMを紹介する。 AEMは操作の時間的性質を目標としており、現在の観測だけでは部分観測性では不十分であることが多い。 AEMはシミュレーションと実世界の両方の環境での操作性能を継続的に改善する。
参考スコア（独自算出の注目度）: 14.760244346330694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.
Abstract（参考訳）: 本稿では、視覚行動履歴からコンパクトな時間表現を学習するロボット操作のためのアクション・エフェクト・メモリ事前学習フレームワークであるAEMについて述べる。単一フレームの視覚的エンコーディングを主眼とする従来のロボット表現事前訓練法とは異なり、AEMは操作の時間的性質を目標としており、現在の観察だけでは部分的観測性では不十分であることが多い。 AEMは、視覚的特徴と行動的特徴をインターリーブし、不完全な履歴から欠落したコンテンツを復元するためにマスク付きモデリングを適用することで、アクション駆動インタラクションプロセスとしての操作をモデル化する。最終ビジョントークンのMamba符号化された出力は、デコードと下流制御のグローバルコンテキストとして機能するコンパクトな履歴表現として使用される。この設計は、推論を効率よく保ちながら、単一ベクトルの時間的ボトルネックを保っている。 AEMを拡散政策と流動政策で評価する。 AEMは、シミュレーションと実世界の両方の環境での操作性能を一貫して改善し、クリーンなシーン、散らかったシーンとランダムなシーン、そして非マルコフ的なタスクでベースラインを上回っている。アブレーション研究により、履歴を意識した事前学習は、推論遅延と計算コストを低減しつつ、単一フレームの事前学習と直接フレームの積み重ねを超越していることが示された。

論文の概要: Action-Effect Memory Pretraining for Robot Manipulation

関連論文リスト