RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
- URL: http://arxiv.org/abs/2508.02062v1
- Date: Mon, 04 Aug 2025 05:01:11 GMT
- Title: RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
- Authors: Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, Insup Lee,
- Abstract summary: Multi-task vision--action'' (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics.<n>For such models to be truly useful, an end user must have easy means to teach them to improve.<n>For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile interface to easily teach new tasks.
- Score: 20.826907313227323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-task ``vision-language-action'' (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics, achieving non-trivial performance out of the box on new tasks in new environments. However, for such models to be truly useful, an end user must have easy means to teach them to improve. For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile and highly useful interface to easily teach new tasks with no parameter finetuning. Unfortunately, VLAs pre-trained with imitation learning objectives do not naturally acquire ICL abilities. In this paper, we demonstrate that, with the right finetuning recipe and a small robot demonstration dataset, it is possible to inject in-context adaptability post hoc into such a VLA. After retraining for in-context learning (RICL), our system permits an end user to provide a small number (10-20) of demonstrations for a new task. RICL then fetches the most relevant portions of those demonstrations into the VLA context to exploit ICL, performing the new task and boosting task performance. We apply RICL to inject ICL into the $\pi_{0}$-FAST VLA, and show that it permits large in-context improvements for a variety of new manipulation tasks with only 20 demonstrations per task, without any parameter updates. When parameter updates on the target task demonstrations is possible, RICL finetuning further boosts performance. We release code and model weights for RICL-$\pi_{0}$-FAST alongside the paper to enable, for the first time, a simple in-context learning interface for new manipulation tasks. Website: https://ricl-vla.github.io.
Related papers
- CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion [9.808005698482914]
CLARE is a framework for exemplar-free continual learning with vision-language-action models.<n>We show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks.
arXiv Detail & Related papers (2026-01-14T14:23:42Z) - Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning? [7.827653846113951]
Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks.
We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context.
arXiv Detail & Related papers (2024-09-25T16:45:02Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes [6.652837942112205]
Large language models (LLM) have recently shown the extraordinary ability to perform unseen tasks based on few-shot examples provided as text.
We propose several effective curriculum learning strategies that allow ICL models to achieve higher data efficiency and more stable convergence.
Our experiments reveal that ICL models can effectively learn difficult tasks by training on progressively harder tasks while mixing in prior tasks, denoted as mixed curriculum in this work.
arXiv Detail & Related papers (2024-04-04T16:15:23Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Learning without Forgetting for Vision-Language Models [86.53237963364754]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.<n>Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.<n>We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Few-Shot Class-Incremental Learning by Sampling Multi-Phase Tasks [59.12108527904171]
A model should recognize new classes and maintain discriminability over old classes.
The task of recognizing few-shot new classes without forgetting old classes is called few-shot class-incremental learning (FSCIL)
We propose a new paradigm for FSCIL based on meta-learning by LearnIng Multi-phase Incremental Tasks (LIMIT)
arXiv Detail & Related papers (2022-03-31T13:46:41Z) - HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both
Language and Vision-and-Language Tasks [38.43269863509866]
How to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment.
We design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks.
Our proposed framework adds fewer trainable parameters in multi-task learning while achieving superior performances and transfer ability compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-03-08T06:51:33Z) - Learning Adaptable Policy via Meta-Adversarial Inverse Reinforcement
Learning for Decision-making Tasks [2.1485350418225244]
We build an adaptable imitation learning model based on the integration of Meta-learning and Adversarial Inverse Reinforcement Learning.
We exploit the adversarial learning and inverse reinforcement learning mechanisms to learn policies and reward functions simultaneously from available training tasks.
arXiv Detail & Related papers (2021-03-23T17:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.