MEM: Multi-Scale Embodied Memory for Vision Language Action Models
- URL: http://arxiv.org/abs/2603.03596v1
- Date: Wed, 04 Mar 2026 00:03:02 GMT
- Title: MEM: Multi-Scale Embodied Memory for Vision Language Action Models
- Authors: Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, Danny Driess,
- Abstract summary: We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies.<n>MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory.<n>MEM enables robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich.
- Score: 73.3883864595845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
Related papers
- RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z) - From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z) - RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design [77.30163153176954]
RMBench is a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity.<n>Mem-0 is a modular manipulation policy with explicit memory components designed to support controlled ablation studies.<n>We identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance.
arXiv Detail & Related papers (2026-03-01T18:59:59Z) - MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization [57.17751568928966]
We propose MetaMem, a framework that augments memory systems with a self-evolving meta-memory.<n>During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks.<n>Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%.
arXiv Detail & Related papers (2026-01-27T04:46:23Z) - MemVerse: Multimodal Memory for Lifelong Learning Agents [35.218549149012844]
We introduce MemVerse, a model-agnostic, plug-and-play memory framework.<n>MemVerse bridges fast parametric recall with hierarchical retrieval-based memory.<n>It enables scalable and adaptive multimodal intelligence.
arXiv Detail & Related papers (2025-12-03T10:06:14Z) - MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues [30.599201653940852]
Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios.<n>We propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements.<n>MOOM further integrates a forgetting mechanism, inspired by the competition-inhibition'' memory theory, to constrain memory capacity and mitigate uncontrolled growth.
arXiv Detail & Related papers (2025-09-15T12:35:14Z) - MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation [59.31354761628506]
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it.<n>We propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation.<n>We evaluate it on 150+ simulation and real-world tasks across three robots.
arXiv Detail & Related papers (2025-08-26T17:57:16Z) - RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems [41.89907261427986]
Embodied agents face persistent challenges in real-world environments, including partial observability, limited spatial reasoning, and high-latency multi-memory integration.<n>We present RoboMemory, a brain-inspired framework that unifies Spatial, Temporal, Episodic, and Semantic memory under a parallelized architecture for efficient long-horizon planning and interactive environmental learning.<n> Experiments on EmbodiedBench show that RoboMemory, built on Qwen2.5-VL-72B-Ins, improves average success rates by 25% over its baseline and exceeds the closed-source state-of-the-art (SOTA) Gemini-1.5-
arXiv Detail & Related papers (2025-08-02T15:39:42Z) - From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs [34.361000444808454]
Memory is the process of encoding, storing, and retrieving information.<n>In the era of large language models (LLMs), memory refers to the ability of an AI system to retain, recall, and use information from past interactions to improve future responses and interactions.
arXiv Detail & Related papers (2025-04-22T15:05:04Z) - Self-Attentive Associative Memory [69.40038844695917]
We propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory)
We achieve competitive results with our proposed two-memory model in a diversity of machine learning tasks.
arXiv Detail & Related papers (2020-02-10T03:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.