Continually Evolving Skill Knowledge in Vision Language Action Model
- URL: http://arxiv.org/abs/2511.18085v2
- Date: Tue, 25 Nov 2025 02:25:13 GMT
- Title: Continually Evolving Skill Knowledge in Vision Language Action Model
- Authors: Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, Hesheng Wang,
- Abstract summary: Development of general robot intelligence in open environments requires continual skill learning.<n>We propose Stellar VLA, a knowledge-driven continual learning framework with two variants: T-Stellar, modeling task-centric knowledge space, and TS-Stellar, capturing hierarchical task-skill structure.<n> Experiments on the LIBERO benchmark and real-world tasks show over 50 percentage average improvement in final success rates relative to baselines.
- Score: 23.63528439700931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing general robot intelligence in open environments requires continual skill learning. Recent Vision-Language-Action (VLA) models leverage massive pretraining data to support diverse manipulation tasks, but they still depend heavily on task-specific fine-tuning, revealing a lack of continual learning capability. Existing continual learning methods are also resource-intensive to scale to VLA models. We propose Stellar VLA, a knowledge-driven continual learning framework with two variants: T-Stellar, modeling task-centric knowledge space, and TS-Stellar, capturing hierarchical task-skill structure. Stellar VLA enables self-supervised knowledge evolution through joint learning of task latent representation and the knowledge space, reducing annotation needs. Knowledge-guided expert routing provide task specialization without extra network parameters, lowering training overhead. Experiments on the LIBERO benchmark and real-world tasks show over 50 percentage average improvement in final success rates relative to baselines. TS-Stellar further excels in complex action inference, and in-depth analyses verify effective knowledge retention and discovery. Our code will be released soon.
Related papers
- Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models [10.431923437214719]
Vision-Language Models (VLMs) suffer from catastrophic forgetting when sequentially fine-tuned on new tasks.<n>We introduce a routing-based approach that enables the integration of new tasks while preserving the foundational knowledge acquired during pretraining.
arXiv Detail & Related papers (2025-11-03T18:39:32Z) - Learning Affordances at Inference-Time for Vision-Language-Action Models [50.93181349331096]
In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks.<n>We introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences.<n>Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution.
arXiv Detail & Related papers (2025-10-22T16:43:29Z) - VLM-Assisted Continual learning for Visual Question Answering in Self-Driving [26.413685340816436]
We propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving.<n>In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings.<n>We present a novel continual learning framework that combines Vision-Language Models with selective memory replay and knowledge distillation.
arXiv Detail & Related papers (2025-02-02T16:27:44Z) - KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning [74.21524111840652]
This paper proposes textbfKaLM, a textitKnowledge-aligned Language Modeling approach.<n>It fine-tunes autoregressive large language models to align with KG knowledge via the joint objective of explicit knowledge alignment and implicit knowledge alignment.<n> Notably, our method achieves a significant performance boost in evaluations of knowledge-driven tasks.
arXiv Detail & Related papers (2024-12-06T11:08:24Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Composite Learning Units: Generalized Learning Beyond Parameter Updates to Transform LLMs into Adaptive Reasoners [0.0]
We introduce Composite Learning Units (CLUs) designed to transform reasoners into learners capable of continuous learning.
CLUs are built on an architecture that allows a reasoning model to maintain and evolve a dynamic knowledge repository.
We demonstrate CLUs' effectiveness through a cryptographic reasoning task, where they continuously evolve their understanding through feedback to uncover hidden transformation rules.
arXiv Detail & Related papers (2024-10-09T02:27:58Z) - KIF: Knowledge Identification and Fusion for Language Model Continual Learning [41.28933724210434]
We introduce a novel framework for language models, named Knowledge Identification and Fusion (KIF)<n>KIF segregates the model into'skill units' based on parameter dependencies, allowing for more precise control.<n>It employs a novel group-wise knowledge identification technique to ascertain the importance distribution of skill units for a new task.<n>As a result, KIF achieves an optimal balance between retaining prior knowledge and excelling in new tasks.
arXiv Detail & Related papers (2024-08-09T17:44:45Z) - Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [79.28821338925947]
Domain-Class Incremental Learning is a realistic but challenging continual learning scenario.
To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability.
This incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability.
Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy overhead.
We propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of
arXiv Detail & Related papers (2024-07-07T12:19:37Z) - Task-Attentive Transformer Architecture for Continual Learning of
Vision-and-Language Tasks Using Knowledge Distillation [18.345183818638475]
Continual learning (CL) can serve as a remedy through enabling knowledge-transfer across sequentially arriving tasks.
We develop a transformer-based CL architecture for learning bimodal vision-and-language tasks.
Our approach is scalable learning to a large number of tasks because it requires little memory and time overhead.
arXiv Detail & Related papers (2023-03-25T10:16:53Z) - LEAGUE: Guided Skill Learning and Abstraction for Long-Horizon
Manipulation [16.05029027561921]
Task and Motion Planning approaches excel at solving and generalizing across long-horizon tasks.
They assume predefined skill sets, which limits their real-world applications.
We propose an integrated task planning and skill learning framework named LEAGUE.
We show that the learned skills can be reused to accelerate learning in new tasks domains and transfer to a physical robot platform.
arXiv Detail & Related papers (2022-10-23T06:57:05Z) - Learning and Retrieval from Prior Data for Skill-based Imitation
Learning [47.59794569496233]
We develop a skill-based imitation learning framework that extracts temporally extended sensorimotor skills from prior data.
We identify several key design choices that significantly improve performance on novel tasks.
arXiv Detail & Related papers (2022-10-20T17:34:59Z) - Hierarchical Skills for Efficient Exploration [70.62309286348057]
In reinforcement learning, pre-trained low-level skills have the potential to greatly facilitate exploration.
Prior knowledge of the downstream task is required to strike the right balance between generality (fine-grained control) and specificity (faster learning) in skill design.
We propose a hierarchical skill learning framework that acquires skills of varying complexity in an unsupervised manner.
arXiv Detail & Related papers (2021-10-20T22:29:32Z) - Bilevel Continual Learning [76.50127663309604]
We present a novel framework of continual learning named "Bilevel Continual Learning" (BCL)
Our experiments on continual learning benchmarks demonstrate the efficacy of the proposed BCL compared to many state-of-the-art methods.
arXiv Detail & Related papers (2020-07-30T16:00:23Z) - Automated Relational Meta-learning [95.02216511235191]
We propose an automated relational meta-learning framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph.
We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines.
arXiv Detail & Related papers (2020-01-03T07:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.