Related papers: VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

URL: http://arxiv.org/abs/2512.01031v1
Date: Sun, 30 Nov 2025 18:59:24 GMT
Title: VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference
Authors: Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, Song Han,
Abstract summary: Asynchronous inference offers a promising solution to achieve continuous and low-latency control.<n>We propose VLASH, a general asynchronous inference framework for Vision-Language-Action models.<n>It delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes.
Score: 24.248289541718275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash

Related papers

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z)
AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge [49.66156306240961]
High latency breaks the control loop, rendering powerful models unsafe for real-time deployment.<n>We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution.<n>AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-13T21:31:19Z)
TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control [15.534182843429043]
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency.<n>We propose TIDAL, a hierarchical framework that decouples semantic reasoning from high-frequency actuation.<n> TIDAL operates as a backbone-agnostic module for diffusion-basedVLAs, using a dual-frequency architecture.
arXiv Detail & Related papers (2026-01-21T12:43:11Z)
VLA-RAIL: A Real-Time Asynchronous Inference Linker for VLA Models and Robots [5.308743386891208]
Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics.<n>The strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models.<n>Existing methods suffer from jitter, stalling, or even pauses in robotic action execution.<n>This paper introduces VLA-RAIL, a novel framework designed to conduct model inference and robot motion control asynchronously.
arXiv Detail & Related papers (2025-12-31T06:59:42Z)
Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation [10.09057399213028]
Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals.<n>We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS) to organize the system into a fast pathway for action generation and a slow pathway for rich VLM reasoning.
arXiv Detail & Related papers (2025-12-23T09:28:20Z)
Stable Video Infinity: Infinite-Length Video Generation with Error Recycling [76.91310169118408]
We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines.<n> SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer's self-generated errors into supervisory prompts.<n>We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
arXiv Detail & Related papers (2025-10-10T09:45:46Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
Real-Time Execution of Action Chunking Flow Policies [49.1574468325115]
This paper presents a novel inference-time algorithm that enables asynchronous execution of action interacting systems.<n>It is applicable to any diffusion- or VLA-based systems executing out of the box with no re-training.<n>Results show that RTC is fast, performant, and uniquely robust to inference manipulation.
arXiv Detail & Related papers (2025-06-09T01:01:59Z)
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction [81.34648970317383]
We present Dispider, a system that disentangles Perception, Decision, and Reaction.<n>Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses.
arXiv Detail & Related papers (2025-01-06T18:55:10Z)
One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [80.71541671907426]
OneStep Diffusion Policy (OneDP) is a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator. OneDP significantly accelerates response times for robotic control tasks.
arXiv Detail & Related papers (2024-10-28T17:54:31Z)
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [12.373320641721344]
Large Vision-Language-Action (VLA) models have shown promise in robotic control due to their impressive generalization ability.<n>Their reliance on VLM backends with billions of parameters leads to high computational costs and latency inference.<n>This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.
arXiv Detail & Related papers (2024-09-12T09:18:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.