Related papers: CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning

URL: http://arxiv.org/abs/2508.02219v1
Date: Mon, 04 Aug 2025 09:11:48 GMT
Title: CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Authors: Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, Chunhe Xia,
Abstract summary: We propose Chunked RL, a novel reinforcement learning framework for Vision-Language-Action (VLA) models.<n>Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models.<n>We then propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations.
Score: 7.780242426487376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement learning incorporating action chunking. In this work, we propose Chunked RL, a novel reinforcement learning framework specifically designed for VLA models. Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models. Building upon this framework, we propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations (30 to 60 samples). Specifically, we first conduct imitation learning (IL) with full parameter fine-tuning to initialize both the backbone and the policy. Subsequently, we implement offline RL with action chunking to optimize the pretrained policy. Our empirical results in real-world environments demonstrate that CO-RFT outperforms previous supervised methods, achieving a 57% improvement in success rate and a 22.3% reduction in cycle time. Moreover, our method exhibits robust positional generalization capabilities, attaining a success rate of 44.3% in previously unseen positions.

Related papers

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models [11.688277445120567]
Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and promising potential in solving complex robotic manipulation tasks.<n>Their substantial parameter sizes and high inference latency pose significant challenges for real-world deployment.<n>We propose RLRC, a three-stage recovery method for compressed VLAs.
arXiv Detail & Related papers (2025-06-21T08:45:32Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [14.099306230721245]
We present VLA-RL, an exploration-based framework that improves on online collected data at test time.<n>We fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments.<n>VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO.
arXiv Detail & Related papers (2025-05-24T14:42:51Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy [10.596344084789434]
Vision-Language-Action (VLA) models have shown substantial potential in real-world robotic manipulation.<n>Fine-tuning these models through supervised learning struggles to achieve robust performance due to limited, inconsistent demonstrations.<n>We propose a reinforced fine-tuning approach for VLA models, named ConRFT, which consists of offline and online fine-tuning.
arXiv Detail & Related papers (2025-02-08T05:01:17Z)
Improving Vision-Language-Action Model with Online Reinforcement Learning [17.043068379668842]
Recent studies have successfully integrated large vision-language models into low-level robotic control by supervised fine-tuning.<n>We propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models.
arXiv Detail & Related papers (2025-01-28T02:53:48Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
Learn from the Past: A Proxy Guided Adversarial Defense Framework with Self Distillation Regularization [53.04697800214848]
Adversarial Training (AT) is pivotal in fortifying the robustness of deep learning models. AT methods, relying on direct iterative updates for target model's defense, frequently encounter obstacles such as unstable training and catastrophic overfitting. We present a general proxy guided defense framework, LAST' (bf Learn from the Pbf ast)
arXiv Detail & Related papers (2023-10-19T13:13:41Z)
MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations [36.44386146801296]
Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications. We find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks.
arXiv Detail & Related papers (2022-12-12T04:28:50Z)
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.