AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge
- URL: http://arxiv.org/abs/2602.13476v1
- Date: Fri, 13 Feb 2026 21:31:19 GMT
- Title: AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge
- Authors: Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine,
- Abstract summary: High latency breaks the control loop, rendering powerful models unsafe for real-time deployment.<n>We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution.<n>AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines.
- Score: 49.66156306240961
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.
Related papers
- LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics [0.6119773373677944]
We present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware.<n>Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference.<n>Under our configuration, LiteVLA-Edge achieves a mean end-to-end runtime of 150.5,ms (approximately 6.6,Hz) while operating entirely offline.
arXiv Detail & Related papers (2026-03-03T03:20:52Z) - OneLive: Dynamically Unified Generative Framework for Live-Streaming Recommendation [49.95897358060393]
We propose OneLive, a dynamically unified generative recommendation framework tailored for live-streaming scenario.<n>OneLive integrates four key components: (i) A Dynamic Tokenizer that continuously encodes evolving real-time live content fused with behavior signal through residual quantization; (ii) A Time-Aware Gated Attention mechanism that explicitly models temporal dynamics for timely decision making; (iii) An efficient decoder-only generative architecture enhanced with Sequential MTP and QK Norm for stable training and accelerated inference.
arXiv Detail & Related papers (2026-02-09T12:56:39Z) - TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control [15.534182843429043]
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency.<n>We propose TIDAL, a hierarchical framework that decouples semantic reasoning from high-frequency actuation.<n> TIDAL operates as a backbone-agnostic module for diffusion-basedVLAs, using a dual-frequency architecture.
arXiv Detail & Related papers (2026-01-21T12:43:11Z) - Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation [10.09057399213028]
Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals.<n>We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS) to organize the system into a fast pathway for action generation and a slow pathway for rich VLM reasoning.
arXiv Detail & Related papers (2025-12-23T09:28:20Z) - ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning [52.86018040861575]
We propose a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network.<n>We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens.<n>Experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines.
arXiv Detail & Related papers (2025-12-11T18:59:46Z) - FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [12.373320641721344]
Large Vision-Language-Action (VLA) models have shown promise in robotic control due to their impressive generalization ability.<n>Their reliance on VLM backends with billions of parameters leads to high computational costs and latency inference.<n>This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.
arXiv Detail & Related papers (2024-09-12T09:18:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.