Related papers: Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

URL: http://arxiv.org/abs/2510.25616v1
Date: Wed, 29 Oct 2025 15:20:10 GMT
Title: Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Authors: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov,
Abstract summary: Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
Score: 42.41263928527529
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

Related papers

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models [45.11882724608595]
We introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing vision-language models.<n>Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks.
arXiv Detail & Related papers (2026-02-12T11:20:43Z)
VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models [26.542479606920423]
Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks.<n>Despite the success, extending large pretrained VLA models to the action space can induce vision-action misalignment.<n>We propose a training framework that explicitly strengthens visual conditioning in VLA models.
arXiv Detail & Related papers (2026-02-04T20:59:29Z)
Vision-Language Models Unlock Task-Centric Latent Actions [75.53481518882275]
We propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations.<n>We show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
arXiv Detail & Related papers (2026-01-30T08:38:59Z)
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z)
ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z)
Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models [8.452688845632995]
We propose Oat-VLA, an Object-Agent-centric Tokenization for Vision-Language-Action (VLA) models.<n>We find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance.<n>We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.
arXiv Detail & Related papers (2025-09-28T05:42:53Z)
Unified Vision-Language-Action Model [86.68814779303429]
We present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences.<n>Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge.<n>We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
arXiv Detail & Related papers (2025-06-24T17:59:57Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [14.089700378708756]
We introduce textbfUP-VLA, a textbfUnified VLA model training with both multi-modal textbfUnderstanding and future textbfPrediction objectives.<n>UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method.
arXiv Detail & Related papers (2025-01-31T03:20:09Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling. PEVL reformulates discretized object positions and language in a unified language modeling framework. We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images. New visual features significantly improve the performance across all vision language (VL) tasks. We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.