Fugu-MT 論文翻訳(概要): VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

論文の概要: VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

arxiv url: http://arxiv.org/abs/2603.04910v1
Date: Thu, 05 Mar 2026 07:52:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-06 22:06:11.130295
Title: VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory
Title（参考訳）: VPWEM: ワーキング・アンド・エピソード記憶を用いた非マルコフ的ビズモトール政策
Authors: Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo,
Abstract要約: VPWEMは、ワーキングメモリとエピソードメモリを備えた非マルコフヴィジュモータ政策である。動作生成には短期情報とエピソードワイド情報の両方を使用し、1ステップあたりのメモリと計算がほぼ一定である。
参考スコア（独自算出の注目度）: 31.464584758455356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.
Abstract（参考訳）: 人間の実演からの模倣学習は、ロボット制御において大きな成功を収めてきたが、ほとんどのビジュモータ政策は、シングルステップの観察や短文の履歴にまだ条件付けられており、長期記憶を必要とする非マルコフ的なタスクに苦しむ。単にコンテキストウィンドウを拡大すれば、計算コストとメモリコストが大幅に増加し、過度な相関が促進され、分散シフト時の破滅的な失敗と、ロボットシステムにおけるリアルタイムの制約に違反する。対照的に、人間は重要な過去の経験を長期記憶に圧縮し、それらを利用して生涯にわたってタスクを解くことができる。本稿では,ワーキングメモリとエピソードメモリを備えた非マルコフビズモータポリシであるVPWEMを提案する。 VPWEMは、最近の観測トークンのスライドウィンドウを短期ワーキングメモリとして保持し、外部観測を固定数のエピソードメモリトークンに再帰的に変換するTransformerベースのコンテキストメモリ圧縮機を導入している。圧縮機は過去の要約トークンのキャッシュを自己注意し、過去の観測のキャッシュを横断注意し、ポリシーと共同で訓練する。我々はVPWEMの拡散ポリシーをインスタンス化し、ステップ毎にほぼ一定のメモリと計算量を持つアクション生成のための短期情報とエピソードワイド情報の両方を利用する。 VPWEMは、MiKASAのメモリ集約操作タスクにおいて、拡散ポリシーやビジョン言語アクション(VLA)モデルを含む最先端のベースラインを20%以上上回り、モバイルベンチマークのMoMaRTで平均5%改善することを示した。コードはhttps://github.com/HarryLui98/code_vpwem.comで入手できる。

論文の概要: VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

関連論文リスト