Fugu-MT 論文翻訳(概要): ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

論文の概要: ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

arxiv url: http://arxiv.org/abs/2603.12942v1
Date: Fri, 13 Mar 2026 12:38:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.766201
Title: ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries
Title（参考訳）: ReMem-VLA:デュアルレベルリカレントクエリによるメモリ付きビジョンランゲージ・アクションモデル
Authors: Hang Li, Fengyi Shen, Dong Chen, Liudi Yang, Xudong Wang, Jinkui Shi, Zhenshan Bing, Ziyuan Liu, Alois Knoll,
Abstract要約: 本稿では、2組の学習可能なクエリを備えた視覚言語アクション(VLA)モデルであるReMem-VLAを紹介する。これらのクエリはエンドツーエンドでトレーニングされ、時間とともに関連するコンテキストを集約し、維持する。 ReMem-VLAは複数の次元にまたがる強力なメモリ能力を示す。
参考スコア（独自算出の注目度）: 45.23935281952228
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines $π$0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.
Abstract（参考訳）: クローズドループロボット制御のための視覚言語アクション(VLA)モデルは、通常はマルコフの仮定に基づいており、歴史的文脈を必要とするタスクの誤りを生じさせる。メモリを組み込むには、既存のVLAがメモリバンクから取り出すか、邪魔者によって誤解される可能性があるか、フレームウィンドウを拡張するかのいずれかで、固定地平線は長期保持を制限する。本稿では,連続するフレーム間で情報を伝播して短期記憶をサポートするフレームレベルのリカレントメモリクエリと,長期記憶のための時間的チャンク間のコンテキストを運ぶチャンクレベルのリカレントメモリクエリという,学習可能なクエリのセットを備えたリカレントメモリVLAモデルであるReMem-VLAを紹介する。これらのクエリは、時間とともに関連するコンテキストを集約し、維持するようにエンドツーエンドにトレーニングされ、追加のトレーニングや推論コストなしで、モデルの決定を暗黙的に導く。さらに,視覚記憶を向上させるために,過去の観測予測を補助訓練の目的として紹介する。大規模なメモリ中心シミュレーションと実世界のロボット実験により、ReMem-VLAは空間、シーケンシャル、エピソード、時間、視覚記憶を含む複数の次元にわたる強い記憶能力を示すことを示した。 ReMem-VLAは、メモリフリーのVLAベースラインであるπ$0.5とOpenVLA-OFTを大きく上回り、メモリ依存タスクにおいてMemoryVLAを上回っている。

論文の概要: ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

関連論文リスト