VLANeXt: Recipes for Building Strong VLA Models
- URL: http://arxiv.org/abs/2602.18532v1
- Date: Fri, 20 Feb 2026 09:26:17 GMT
- Title: VLANeXt: Recipes for Building Strong VLA Models
- Authors: Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy,
- Abstract summary: Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for policy learning.<n>Many groups have proposed their own VLA models, but inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter.<n>We will release a unified, easy-to-use framework that serves as a common platform for the community to reproduce our findings.
- Score: 95.4552662536287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.
Related papers
- SimVLA: A Simple VLA Baseline for Robotic Manipulation [46.38114519538192]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation.<n>We introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research.
arXiv Detail & Related papers (2026-02-20T14:04:27Z) - VLA-R1: Enhancing Reasoning in Vision-Language-Action Models [35.264042764326895]
Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation.<n>Current VLA models often lack explicit step-by-step reasoning.<n>We present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards.
arXiv Detail & Related papers (2025-10-02T02:54:03Z) - Pure Vision Language Action (VLA) Models: A Comprehensive Survey [16.014856048038272]
The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics.<n>This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research.
arXiv Detail & Related papers (2025-09-23T13:53:52Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models [39.706833232931245]
Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning.<n>By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance.<n>In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices.<n>We develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments.
arXiv Detail & Related papers (2024-12-18T17:07:20Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.<n>Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.<n>We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation [7.8735930411335895]
We present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models.<n>Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models.
arXiv Detail & Related papers (2024-09-19T16:33:00Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [55.5610165938949]
Fine-tuning vision-language models (VLMs) has gained increasing popularity due to its practical value.
This paper explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model.
We introduce three customized ensemble strategies, each tailored to one specific scenario.
The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-28T05:17:25Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.