EvoVLA: Self-Evolving Vision-Language-Action Model
- URL: http://arxiv.org/abs/2511.16166v1
- Date: Thu, 20 Nov 2025 09:08:33 GMT
- Title: EvoVLA: Self-Evolving Vision-Language-Action Model
- Authors: Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang,
- Abstract summary: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models.<n>We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components.<n>EvoVLA achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent.
- Score: 11.746804244345613
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.
Related papers
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [54.150202739999806]
LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
arXiv Detail & Related papers (2026-02-25T03:33:39Z) - Recursive Belief Vision Language Action Models [0.0]
Long-horizon manipulation requires persistent, action-conditioned state representations.<n>Current vision-language models exhibit limited temporal and physical reasoning.<n>This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives.
arXiv Detail & Related papers (2026-02-24T08:02:16Z) - Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z) - Self-Improving Vision-Language-Action Models with Data Generation via Residual RL [29.682761652941963]
Probe, Learn, Distill (PLD) is a three-stage plug-and-play framework that improves vision-language-action models.<n>PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks.
arXiv Detail & Related papers (2025-10-30T06:24:04Z) - Contrastive Representation Regularization for Vision-Language-Action Models [64.10170453130324]
We introduce Robot State-aware Contrastive Loss (RS-CL), a representation regularization for Vision-Language-Action (VLA) models.<n>In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision.<n>Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models.
arXiv Detail & Related papers (2025-10-02T06:41:22Z) - On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations [52.1029745126386]
In vision-language-action (VLA) models, robustness to real-world perturbations is critical for deployment.<n>We propose RobustVLA against perturbations in VLA inputs and outputs.<n> Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone.
arXiv Detail & Related papers (2025-09-26T14:42:23Z) - BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [37.699828966838986]
BridgeVLA is a novel 3D VLA model that projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone.<n>It utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space.<n>It is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency.
arXiv Detail & Related papers (2025-06-09T17:36:34Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.