Fugu-MT 論文翻訳(概要): Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

論文の概要: Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

arxiv url: http://arxiv.org/abs/2606.18960v2
Date: Thu, 18 Jun 2026 07:33:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 13:55:51.900803
Title: Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation
Title（参考訳）: Mem-World:永続ロボットマニピュレーションのためのメモリ拡張アクションコンディション世界モデル
Authors: Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia,
Abstract要約: アクション条件付き世界モデルは、ロボット学習の有望なパラダイムとして登場した。 Mem-Worldはメモリ拡張されたマルチビューアクション条件の世界モデルである。 W-VMemは4次元手首ビュー中心のサーベイルインデクシングメモリで、歴史的観測を時間的に変化する表面要素に固定する。
参考スコア（独自算出の注目度）: 55.42006264038458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.
Abstract（参考訳）: アクション条件付き世界モデルは、ロボット学習の有望なパラダイムとして登場し、アクション一貫性のあるビデオロールアウトを生成することで、コストのかかる実世界の実験に代わるスケーラブルな代替手段を提供する。頻繁なエンドエフェクターの閉塞と手首カメラの動きは、現在の観察を将来の展望を予測するには不十分にし、モデルが以前のフレームで見られるシーンの詳細を忘れたり幻覚させる原因となった。既存のメモリ検索戦略は、動的操作シナリオにおける情報的履歴の特定に失敗することが多い。この制限に対処するため,メモリ拡張多視点アクション条件世界モデルであるMem-Worldを提案する。中心となるW-VMemは、4D手首ビュー中心のサーベイルインデクシングメモリで、歴史的観測を時間的に変化する表面要素に固定する。 W-VMemは、シーン要素がいつ、どこで観測されるかを明確にモデル化することにより、将来のアクションに照らされた関連する履歴フレームの幾何認識検索を可能にする。生成中、関連する履歴フレームは、サーベイルベースのレンダリングとスコアリングによって選択され、予測のための情報的および非冗長なコンテキストを提供する。大規模な実験により、Mem-Worldは複雑な操作シナリオにおける永続的なロールアウトを生成し、Ctrl-Worldよりも信頼性の高いポリシー評価を可能にし、Pearsonと実世界のパフォーマンスの相関を14.5\%改善し、合成データ生成による効果的なポリシー改善をサポートし、長期的なタスクにおける成功率を58\%から72\%に向上した。

論文の概要: Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

関連論文リスト