Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
- URL: http://arxiv.org/abs/2511.00091v1
- Date: Thu, 30 Oct 2025 06:24:04 GMT
- Title: Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
- Authors: Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi "Jim" Fan, Guanya Shi, Yuke Zhu,
- Abstract summary: Probe, Learn, Distill (PLD) is a three-stage plug-and-play framework that improves vision-language-action models.<n>PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks.
- Score: 29.682761652941963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.
Related papers
- Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs [60.68927774057402]
We show, for the first time, that a lower simplicity bias induces a better generalization.<n>Motivated by this insight, we demonstrate that the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization.<n>Our strategy improves the performance of multiple language models including Phi2-2.7B, Llama3.2-1B, Gemma3-1B-PT, Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon.
arXiv Detail & Related papers (2026-01-31T07:40:36Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models [42.89413870143421]
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations.<n>Current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity.<n>We propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework.
arXiv Detail & Related papers (2025-11-19T16:52:23Z) - Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning [81.7764584515496]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation.<n>These models face two fundamental challenges: scarcity and high cost of large-scale human-operated robotic trajectories.<n>We introduce SimpleVLA-RL, an efficient reinforcement learning framework tailored for VLA models.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning [7.780242426487376]
We propose Chunked RL, a novel reinforcement learning framework for Vision-Language-Action (VLA) models.<n>Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models.<n>We then propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations.
arXiv Detail & Related papers (2025-08-04T09:11:48Z) - RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models [11.688277445120567]
Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and promising potential in solving complex robotic manipulation tasks.<n>Their substantial parameter sizes and high inference latency pose significant challenges for real-world deployment.<n>We propose RLRC, a three-stage recovery method for compressed VLAs.
arXiv Detail & Related papers (2025-06-21T08:45:32Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.