VLA-RAIL: A Real-Time Asynchronous Inference Linker for VLA Models and Robots
- URL: http://arxiv.org/abs/2512.24673v1
- Date: Wed, 31 Dec 2025 06:59:42 GMT
- Title: VLA-RAIL: A Real-Time Asynchronous Inference Linker for VLA Models and Robots
- Authors: Yongsheng Zhao, Lei Zhao, Baoping Cheng, Gongxin Yao, Xuanzhang Wen, Han Gao,
- Abstract summary: Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics.<n>The strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models.<n>Existing methods suffer from jitter, stalling, or even pauses in robotic action execution.<n>This paper introduces VLA-RAIL, a novel framework designed to conduct model inference and robot motion control asynchronously.
- Score: 5.308743386891208
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics, with the action chunk playing a dominant role in these advances. Given the real-time and continuous nature of robotic motion control, the strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models. Existing methods suffer from jitter, stalling, or even pauses in robotic action execution, which not only limits the achievable execution speed but also reduces the overall success rate of task completion. This paper introduces VLA-RAIL (A Real-Time Asynchronous Inference Linker), a novel framework designed to address these issues by conducting model inference and robot motion control asynchronously and guaranteeing smooth, continuous, and high-speed action execution. The core contributions of the paper are two fold: a Trajectory Smoother that effectively filters out the noise and jitter in the trajectory of one action chunk using polynomial fitting and a Chunk Fuser that seamlessly align the current executing trajectory and the newly arrived chunk, ensuring position, velocity, and acceleration continuity between two successive action chunks. We validate the effectiveness of VLA-RAIL on a benchmark of dynamic simulation tasks and several real-world manipulation tasks. Experimental results demonstrate that VLA-RAIL significantly reduces motion jitter, enhances execution speed, and improves task success rates, which will become a key infrastructure for the large-scale deployment of VLA models.
Related papers
- Mean-Flow based One-Step Vision-Language-Action [15.497933767026568]
FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks.<n>They are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations.<n>We propose a Mean-Flow based One-Step VLA approach, which resolves the noise-induced issues in the action generation process.
arXiv Detail & Related papers (2026-03-02T05:30:30Z) - Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation [10.09057399213028]
Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals.<n>We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS) to organize the system into a fast pathway for action generation and a slow pathway for rich VLM reasoning.
arXiv Detail & Related papers (2025-12-23T09:28:20Z) - Robotic VLA Benefits from Joint Learning with Motion Image Diffusion [114.60268819583017]
Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions.<n>We propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities.<n> Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5%.
arXiv Detail & Related papers (2025-12-19T19:07:53Z) - FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference [24.248289541718275]
Asynchronous inference offers a promising solution to achieve continuous and low-latency control.<n>We propose VLASH, a general asynchronous inference framework for Vision-Language-Action models.<n>It delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes.
arXiv Detail & Related papers (2025-11-30T18:59:24Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [24.1236728596359]
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation.<n>We propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking.<n>Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations.
arXiv Detail & Related papers (2025-03-04T06:12:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.