Refined Policy Distillation: From VLA Generalists to RL Experts
- URL: http://arxiv.org/abs/2503.05833v1
- Date: Thu, 06 Mar 2025 12:52:11 GMT
- Title: Refined Policy Distillation: From VLA Generalists to RL Experts
- Authors: Tobias Jülg, Wolfram Burgard, Florian Walter,
- Abstract summary: We present Refined Policy Distillation (RPD), an RL-based policy refinement method.<n>RPD enables the RL agent to learn expert policies that surpass the teacher's performance in both dense and sparse reward settings.<n>Our approach is even robust to changes in the camera perspective and can generalize to task variations that the underlying VLA cannot solve.
- Score: 18.186499704928092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent generalist Vision-Language-Action Models (VLAs) can perform a variety of tasks on real robots with remarkable generalization capabilities. However, reported success rates are often not on par with those of expert policies. Moreover, VLAs usually do not work out of the box and often must be fine-tuned as they are sensitive to setup changes. In this work, we present Refined Policy Distillation (RPD), an RL-based policy refinement method that enables the distillation of large generalist models into small, high-performing expert policies. The student policy is guided during the RL exploration by actions of a teacher VLA for increased sample efficiency and faster convergence. Different from previous work that focuses on applying VLAs to real-world experiments, we create fine-tuned versions of Octo and OpenVLA for ManiSkill2 to evaluate RPD in simulation. As our results for different manipulation tasks demonstrate, RPD enables the RL agent to learn expert policies that surpass the teacher's performance in both dense and sparse reward settings. Our approach is even robust to changes in the camera perspective and can generalize to task variations that the underlying VLA cannot solve.
Related papers
- What Can RL Bring to VLA Generalization? An Empirical Study [42.58811485942049]
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI.<n>Their predominant training via supervised fine-tuning (SFT) limits generalization due to compounding errors under distribution shifts.<n>Our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning.
arXiv Detail & Related papers (2025-05-26T10:19:26Z) - LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models [28.264011412168347]
Policy exploration is critical in reinforcement learning (RL), where existing approaches include greedy, Gaussian process, etc.<n>We design LLM-Explorer to adaptively generate task-specific exploration strategies with large language models (LLMs)<n>Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them.
arXiv Detail & Related papers (2025-05-21T09:24:23Z) - Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better [0.3021678014343889]
Supervised learning (SL) and reinforcement learning (RL) are widely used to train general-purpose agents for complex tasks.
This paper provides a direct comparison between SL and RL in terms of zero-shot generalization.
arXiv Detail & Related papers (2025-03-19T21:03:27Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z) - RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning [53.8293458872774]
We propose Reinforcement Learning Distilled Generalists (RLDG) to generate high-quality training data for finetuning generalist policies.<n>We demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations.<n>Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems.
arXiv Detail & Related papers (2024-12-13T04:57:55Z) - Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)<n>We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies [5.543220407902113]
We show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts.
Our results show that this method yields more consistent and higher performing agents on the environments we tested.
arXiv Detail & Related papers (2021-09-12T20:12:46Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.