Related papers: MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

URL: http://arxiv.org/abs/2508.19236v1
Date: Tue, 26 Aug 2025 17:57:16 GMT
Title: MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Authors: Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang,
Abstract summary: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it.<n>We propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation.<n>We evaluate it on 150+ simulation and real-world tasks across three robots.
Score: 59.31354761628506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

Related papers

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory [31.464584758455356]
VPWEM is a non-Markovian visuomotor policy equipped with working and episodic memories.<n>It exploits both short-term and episode-wide information for action generation with nearly constant memory and computation per step.
arXiv Detail & Related papers (2026-03-05T07:52:50Z)
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z)
MEM: Multi-Scale Embodied Memory for Vision Language Action Models [73.3883864595845]
We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies.<n>MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory.<n>MEM enables robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich.
arXiv Detail & Related papers (2026-03-04T00:03:02Z)
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z)
HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents [3.9396865837159822]
HiMem is a hierarchical long-term memory framework for long-horizon dialogues.<n>It supports memory construction, retrieval, and dynamic updating during sustained interactions.<n>Results show HiMem consistently outperforms representative baselines in accuracy, consistency, and long-term reasoning.
arXiv Detail & Related papers (2026-01-10T01:26:01Z)
Memory in the Age of AI Agents [217.9368190980982]
This work aims to provide an up-to-date landscape of current agent memory research.<n>We identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory.<n>To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks.
arXiv Detail & Related papers (2025-12-15T17:22:34Z)
ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL [48.214881182054164]
We propose ELMUR, a transformer architecture with structured external memory.<n>ELMUR extends effective horizons up to 100,000 times beyond the attention window.<n>It achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps.
arXiv Detail & Related papers (2025-10-08T15:50:34Z)
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems [29.881808496043387]
We present RoboMemory, a brain-inspired multi-memory framework for lifelong learning in physical embodied systems.<n>It addresses challenges in real-world environments: continuous learning, multi-module memory latency, task correlation capture, and infinite-loop mitigation in closed-loop planning.
arXiv Detail & Related papers (2025-08-02T15:39:42Z)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.89792845476579]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z)
Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z)
KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems [12.461941212597877]
Embodied AI agents often face difficulties with in-context memory, leading to inefficiencies and errors in task execution.<n>We introduce KARMA, an innovative memory system that integrates long-term and short-term memory modules.<n>Our memory-augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks.
arXiv Detail & Related papers (2024-09-23T11:02:46Z)
HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing [33.720656946186885]
Hierarchical Memory Transformer (HMT) is a novel framework that facilitates a model's long-context processing ability.<n>HMT consistently improves the long-context processing ability of existing models.
arXiv Detail & Related papers (2024-05-09T19:32:49Z)
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores. We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores. XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z)
LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens. LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length. Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z)
Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model. We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.