VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
- URL: http://arxiv.org/abs/2601.03309v1
- Date: Tue, 06 Jan 2026 09:58:24 GMT
- Title: VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
- Authors: Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen,
- Abstract summary: Vision-Language-Action (VLA) models integrate pretrained large Vision-Language Models (VLM) into their policy backbone.<n>This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance.<n>We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters.
- Score: 43.09726338623949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
Related papers
- Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z) - MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization [30.871663465403625]
We present MAPS, the first robust fine-tuning framework for Vision-Language-Action (VLA) models.<n>Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility.<n> MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely.
arXiv Detail & Related papers (2025-11-25T03:39:37Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs [4.296395082987112]
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks.<n>Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts.<n>We introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs.
arXiv Detail & Related papers (2025-09-20T11:12:23Z) - SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer [14.669949808424409]
Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction.<n>Existing systems commonly employ partitioned Large Vision-Language Models (LVLMs) or task offloading strategies.<n>We propose a novel cloud-edge collaborative paradigm forVLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context.
arXiv Detail & Related papers (2025-08-18T05:51:41Z) - Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [84.84277196012907]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.<n>We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
GLOV enables Large Language Models (LLMs) to act as implicit encoders for Vision-Language Models (VLMs)<n>We show that GLOV shows performance improvement by up to 15.0% and 57.5% for dual-encoder (e.g., CLIP) and VL-decoder (e.g., LlaVA) models for object recognition.
arXiv Detail & Related papers (2024-10-08T15:55:40Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.