How Do VLAs Effectively Inherit from VLMs?
- URL: http://arxiv.org/abs/2511.06619v1
- Date: Mon, 10 Nov 2025 01:58:02 GMT
- Title: How Do VLAs Effectively Inherit from VLMs?
- Authors: Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian,
- Abstract summary: Vision-language-action (VLA) models hold the promise to attain generalizable embodied control.<n>We introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task.<n>We investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions.
- Score: 28.72002932514493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.
Related papers
- VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models [43.09726338623949]
Vision-Language-Action (VLA) models integrate pretrained large Vision-Language Models (VLM) into their policy backbone.<n>This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance.<n>We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters.
arXiv Detail & Related papers (2026-01-06T09:58:24Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models [5.660635614478238]
Vision-Language-Action (VLA) models promise to produce versatile, "generalist" robot policies.<n>Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions.<n>We introduce a unified suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects.
arXiv Detail & Related papers (2025-06-11T16:52:18Z) - ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [14.143521529613533]
Vision-language-action (VLA) models have emerged as the next generation of models in robotics.<n>Existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks.<n>We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies.
arXiv Detail & Related papers (2025-05-28T02:48:42Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.<n>We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [6.783762650831429]
We review the fundamental theories related to visual language models (VLMs) and the datasets constructed for them in remote sensing.<n>We categorize the improvement methods into three main parts according to the core components ofVLMs and provide a detailed introduction and comparison of these methods.
arXiv Detail & Related papers (2024-10-15T13:28:55Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.