VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
- URL: http://arxiv.org/abs/2602.05049v1
- Date: Wed, 04 Feb 2026 20:59:29 GMT
- Title: VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
- Authors: Yiye Chen, Yanan Jian, Xiaoyi Dong, Shuxin Cao, Jing Wu, Patricio Vela, Benjamin E. Lundell, Dongdong Chen,
- Abstract summary: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks.<n>Despite the success, extending large pretrained VLA models to the action space can induce vision-action misalignment.<n>We propose a training framework that explicitly strengthens visual conditioning in VLA models.
- Score: 26.542479606920423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
Related papers
- ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z) - Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - iFlyBot-VLA Technical Report [25.330744626382977]
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework.<n>The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; and (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets.
arXiv Detail & Related papers (2025-11-01T06:24:56Z) - Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization [42.41263928527529]
Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
arXiv Detail & Related papers (2025-10-29T15:20:10Z) - ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context [54.58057019521198]
Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
arXiv Detail & Related papers (2025-10-05T15:29:57Z) - EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.