Related papers: CRL-VLA: Continual Vision-Language-Action Learning

CRL-VLA: Continual Vision-Language-Action Learning

URL: http://arxiv.org/abs/2602.03445v1
Date: Tue, 03 Feb 2026 12:09:53 GMT
Title: CRL-VLA: Continual Vision-Language-Action Learning
Authors: Qixin Zeng, Shuo Zhang, Hongyin Zhang, Renjie Wang, Han Zhao, Libang Zhao, Runze Li, Donglin Wang, Chao Huang,
Abstract summary: Continual Reinforcement Learning is a promising pathway for deploying VLA models in lifelong robotic scenarios.<n>We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds.<n>We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence.
Score: 40.18167835795084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.

Related papers

Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models [33.503927352666096]
Vision-Language-Action (VLA) models fail to generalize reliably in out-of-distribution deployments.<n>We introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models.<n>Our results highlight the importance of robustness-aware RL post-training as a key step toward improving the principled reliability and robustness of VLA models.
arXiv Detail & Related papers (2025-11-03T08:30:48Z)
Lyapunov Stability Learning with Nonlinear Control via Inductive Biases [21.083462885546556]
Finding a control Lyapunov function (CLF) in a dynamical system with a controller is an effective way to guarantee stability.<n>Recent deep learning models representing CLFs have been applied into a learner-verifier framework to identify satisfiable candidates.<n>We improve this framework by treating Lyapunov conditions as inductive biases and design a neural CLF and a CLF-based controller guided by this knowledge.
arXiv Detail & Related papers (2025-11-03T06:57:37Z)
Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z)
Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning [38.68600863590734]
We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL)<n>VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund.<n>We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees.
arXiv Detail & Related papers (2025-02-11T20:57:46Z)
Continual Task Learning through Adaptive Policy Self-Composition [54.95680427960524]
CompoFormer is a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Our experiments reveal that CompoFormer outperforms conventional continual learning (CL) methods, particularly in longer task sequences.
arXiv Detail & Related papers (2024-11-18T08:20:21Z)
Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning [36.01269673940484]
This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift.<n>Our theoretical and empirical investigations reveal how these factors distort value estimation and policy optimization.<n>We derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training.
arXiv Detail & Related papers (2024-08-23T04:25:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.