CRL-VLA: Continual Vision-Language-Action Learning
- URL: http://arxiv.org/abs/2602.03445v1
- Date: Tue, 03 Feb 2026 12:09:53 GMT
- Title: CRL-VLA: Continual Vision-Language-Action Learning
- Authors: Qixin Zeng, Shuo Zhang, Hongyin Zhang, Renjie Wang, Han Zhao, Libang Zhao, Runze Li, Donglin Wang, Chao Huang,
- Abstract summary: Continual Reinforcement Learning is a promising pathway for deploying VLA models in lifelong robotic scenarios.<n>We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds.<n>We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence.
- Score: 40.18167835795084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models [33.503927352666096]
Vision-Language-Action (VLA) models fail to generalize reliably in out-of-distribution deployments.<n>We introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models.<n>Our results highlight the importance of robustness-aware RL post-training as a key step toward improving the principled reliability and robustness of VLA models.
arXiv Detail & Related papers (2025-11-03T08:30:48Z) - Lyapunov Stability Learning with Nonlinear Control via Inductive Biases [21.083462885546556]
Finding a control Lyapunov function (CLF) in a dynamical system with a controller is an effective way to guarantee stability.<n>Recent deep learning models representing CLFs have been applied into a learner-verifier framework to identify satisfiable candidates.<n>We improve this framework by treating Lyapunov conditions as inductive biases and design a neural CLF and a CLF-based controller guided by this knowledge.
arXiv Detail & Related papers (2025-11-03T06:57:37Z) - Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z) - Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z) - Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning [38.68600863590734]
We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL)<n>VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund.<n>We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees.
arXiv Detail & Related papers (2025-02-11T20:57:46Z) - Continual Task Learning through Adaptive Policy Self-Composition [54.95680427960524]
CompoFormer is a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network.
Our experiments reveal that CompoFormer outperforms conventional continual learning (CL) methods, particularly in longer task sequences.
arXiv Detail & Related papers (2024-11-18T08:20:21Z) - Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning [36.01269673940484]
This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift.<n>Our theoretical and empirical investigations reveal how these factors distort value estimation and policy optimization.<n>We derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training.
arXiv Detail & Related papers (2024-08-23T04:25:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.