Fugu-MT 論文翻訳(概要): MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

論文の概要: MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2606.09827v1
Date: Mon, 08 Jun 2026 17:59:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:07.690517
Title: MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models
Title（参考訳）: MemoryVLA++:Vision-Language-Action Modelにおけるメモリとイマジネーションによる時間モデリング
Authors: Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang,
Abstract要約: 効果的な制御は過去の相互作用の記憶と将来の状態の想像を必要とするため、ロボット操作には時間モデリングが不可欠である。本稿では,VLAモデルにメモリと想像力を付与し,ロボット操作のためのフル時間モデリングフレームワークであるMemoryVLA++を提案する。提案手法は,Libero,SimplerEnv,Mikasa-Robo,Calvin,Libero-Plus,多種多様な実ロボットタスクにまたがって高い性能を実現する。
参考スコア（独自算出の注目度）: 80.70528162709276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web
Abstract（参考訳）: 時間モデリングは、過去の相互作用の記憶と将来の状態の想像の両方を必要とするため、ロボット操作に不可欠である。しかしながら、ほとんどのVLAモデルは、主に現在の観測に依存しているため、時間的に依存した長時間のタスクに苦しむ。認知科学は、人間が作業記憶に頼って短命のコンテキストをバッファリングし、海馬が過去の経験のエピソード記憶を保存し、将来の状態の進化を想像するための内部モデルを構築することを示唆している。これらのメカニズムにインスパイアされたMemoryVLA++は、VLAモデルにメモリと想像力を付与し、ロボット操作を行うためのフル時間モデリングフレームワークである。事前訓練されたVLMは、現在の観察を知覚的および認知的トークンにエンコードし、ワーキングメモリを形成する。これらのトークンはPerceptual-Cognitive Memory Bankに問い合わせ、関連する履歴コンテキストを取得する。この銀行は、過去のインタラクションから低レベルの詳細と高レベルのセマンティクスを格納し、冗長性を認識した統合を通じて更新する。世界モデルは、余分な潜在空間における将来の状態を想像し、想像された潜伏者は、完全な時間的認識トークンを形成するために、メモリガイダンスの下で統合される。結果として生じるトークンは、時間的に一貫したアクションシーケンスを予測する拡散アクションエキスパートを条件とする。 5つのシミュレーションベンチマークと3つのロボットにまたがる実ロボットタスクの3つのカテゴリに対して,汎用的な操作,長期的時間的タスク,堅牢性,一般化に関する広範な実験を行った。提案手法は,リベロ,SimplerEnv,Mikasa-Robo,Calvin,Libero-Plus,多種多様な実ロボットタスクにまたがって高い性能を実現し,メモリとイマジネーションによるフル時間モデルの有効性を検証した。例えば、実際のロボットでは、一般的な、メモリに依存し、想像力に依存したタスクで+9%、+26%、+28%のゲインを達成する。 Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

論文の概要: MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

関連論文リスト