ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
- URL: http://arxiv.org/abs/2510.04246v1
- Date: Sun, 05 Oct 2025 15:29:57 GMT
- Title: ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
- Authors: Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin,
- Abstract summary: Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
- Score: 54.58057019521198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.
Related papers
- VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models [26.542479606920423]
Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks.<n>Despite the success, extending large pretrained VLA models to the action space can induce vision-action misalignment.<n>We propose a training framework that explicitly strengthens visual conditioning in VLA models.
arXiv Detail & Related papers (2026-02-04T20:59:29Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - Unified Vision-Language-Action Model [86.68814779303429]
We present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences.<n>Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge.<n>We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
arXiv Detail & Related papers (2025-06-24T17:59:57Z) - CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation [67.1520483301709]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage.<n>CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO.
arXiv Detail & Related papers (2025-06-24T17:30:27Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.<n>Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.<n>We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.