Related papers: CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

URL: http://arxiv.org/abs/2506.19816v2
Date: Thu, 30 Oct 2025 16:38:19 GMT
Title: CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang,
Abstract summary: CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
Score: 84.51372201195132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

Related papers

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement [27.517125673741486]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic control.<n>We propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens.<n>We introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs.
arXiv Detail & Related papers (2026-02-03T20:17:47Z)
FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z)
ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context [54.58057019521198]
Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
arXiv Detail & Related papers (2025-10-05T15:29:57Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [69.54069477520534]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction [12.064509280163502]
3D occupancy prediction has emerged as a key perception task for autonomous driving.<n>Recent studies focus on integrating information obtained from past observations to improve prediction accuracy.<n>We propose StreamOcc, a framework that aggregates past-temporal information in a stream-based manner.<n>Experiments on the Occ3D-nus dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.
arXiv Detail & Related papers (2025-03-28T02:05:53Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network. We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD) Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.