$π^{*}_{0.6}$: a VLA That Learns From Experience
- URL: http://arxiv.org/abs/2511.14759v2
- Date: Wed, 19 Nov 2025 04:34:49 GMT
- Title: $π^{*}_{0.6}$: a VLA That Learns From Experience
- Authors: Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou,
- Abstract summary: We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL)<n>We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP)<n>We show that the $*_0.6$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine.
- Score: 66.7589999175632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $π^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $π^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
Related papers
- Self-Improving Vision-Language-Action Models with Data Generation via Residual RL [29.682761652941963]
Probe, Learn, Distill (PLD) is a three-stage plug-and-play framework that improves vision-language-action models.<n>PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks.
arXiv Detail & Related papers (2025-10-30T06:24:04Z) - Residual Off-Policy RL for Finetuning Behavior Cloning Policies [41.99435186991878]
We present a recipe that combines the benefits of behavior cloning (BC) and reinforcement learning (RL) through a residual learning framework.<n>Our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems.<n>In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.
arXiv Detail & Related papers (2025-09-23T17:59:46Z) - SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning [81.7764584515496]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation.<n>These models face two fundamental challenges: scarcity and high cost of large-scale human-operated robotic trajectories.<n>We introduce SimpleVLA-RL, an efficient reinforcement learning framework tailored for VLA models.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z) - VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [14.099306230721245]
We present VLA-RL, an exploration-based framework that improves on online collected data at test time.<n>We fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments.<n>VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO.
arXiv Detail & Related papers (2025-05-24T14:42:51Z) - Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers [57.95157497749428]
We propose RL$V$ that augments any value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier.<n> RL$V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32times$ efficient test-time compute scaling.
arXiv Detail & Related papers (2025-05-07T22:41:26Z) - Refined Policy Distillation: From VLA Generalists to RL Experts [18.186499704928092]
We introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning-based policy refinement method.<n>RPD distills and refines Vision-Language-Action Models (VLAs) into compact, high-performing expert policies.<n>Our experimental results show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings.
arXiv Detail & Related papers (2025-03-06T12:52:11Z) - RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning [53.8293458872774]
We propose Reinforcement Learning Distilled Generalists (RLDG) to generate high-quality training data for finetuning generalist policies.<n>We demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations.<n>Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems.
arXiv Detail & Related papers (2024-12-13T04:57:55Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.