Related papers: JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

URL: http://arxiv.org/abs/2602.11832v1
Date: Thu, 12 Feb 2026 11:20:43 GMT
Title: JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
Authors: Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, Mingsheng Long,
Abstract summary: We introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing vision-language models.<n>Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks.
Score: 45.11882724608595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.

Related papers

ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z)
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization [42.41263928527529]
Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
arXiv Detail & Related papers (2025-10-29T15:20:10Z)
ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context [54.58057019521198]
Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
arXiv Detail & Related papers (2025-10-05T15:29:57Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [14.089700378708756]
We introduce textbfUP-VLA, a textbfUnified VLA model training with both multi-modal textbfUnderstanding and future textbfPrediction objectives.<n>UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method.
arXiv Detail & Related papers (2025-01-31T03:20:09Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.<n>Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.<n>We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z)
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling. PEVL reformulates discretized object positions and language in a unified language modeling framework. We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.