Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
- URL: http://arxiv.org/abs/2602.20200v1
- Date: Sun, 22 Feb 2026 15:39:34 GMT
- Title: Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
- Authors: Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie,
- Abstract summary: We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
- Score: 95.89924101984566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.
Related papers
- Mean-Flow based One-Step Vision-Language-Action [15.497933767026568]
FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks.<n>They are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations.<n>We propose a Mean-Flow based One-Step VLA approach, which resolves the noise-induced issues in the action generation process.
arXiv Detail & Related papers (2026-03-02T05:30:30Z) - Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement [27.517125673741486]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic control.<n>We propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens.<n>We introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs.
arXiv Detail & Related papers (2026-02-03T20:17:47Z) - EvoVLA: Self-Evolving Vision-Language-Action Model [11.746804244345613]
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models.<n>We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components.<n>EvoVLA achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent.
arXiv Detail & Related papers (2025-11-20T09:08:33Z) - Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models [29.878993349922368]
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks.<n>We propose Temporal Token Fusion (TTF), a training-free approach that integrates historical and current visual representations to enhance VLA inference quality.
arXiv Detail & Related papers (2025-08-15T12:03:34Z) - CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.