Related papers: LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

URL: http://arxiv.org/abs/2603.03380v1
Date: Tue, 03 Mar 2026 03:20:52 GMT
Title: LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics
Authors: Justin Williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar,
Abstract summary: We present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware.<n>Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference.<n>Under our configuration, LiteVLA-Edge achieves a mean end-to-end runtime of 150.5,ms (approximately 6.6,Hz) while operating entirely offline.
Score: 0.6119773373677944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models provide a unified framework for perception, language conditioning, and action generation, but many existing systems remain difficult to deploy in embedded robotic settings because of their computational requirements and inference latency. In this paper, we present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware. Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference through the \texttt{llama.cpp} runtime. Under our deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5\,ms (approximately 6.6\,Hz) while operating entirely offline within a ROS~2-integrated perception--reasoning--action pipeline. Rather than introducing a new policy objective, our contribution is a practical systems path for executing compact multimodal control models locally on embedded hardware while preserving modular interfaces between perception, reasoning, and actuation. These results establish timing feasibility for reactive language-conditioned control and provide a reproducible baseline for future task-level evaluation of on-device VLAs in robotics.

Related papers

AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge [49.66156306240961]
High latency breaks the control loop, rendering powerful models unsafe for real-time deployment.<n>We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution.<n>AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-13T21:31:19Z)
Vision-Language Models on the Edge for Real-Time Robotic Perception [0.22940141855172028]
Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing, offers a pathway to address these challenges.<n>This work investigates the deployment of Vision-Language Models on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed.<n>Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5%.
arXiv Detail & Related papers (2026-01-21T12:09:48Z)
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge [11.016302257907936]
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control.<n>Current VLA models operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding.<n>We introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms.
arXiv Detail & Related papers (2025-12-23T11:29:03Z)
Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation [10.09057399213028]
Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals.<n>We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS) to organize the system into a fast pathway for action generation and a slow pathway for rich VLM reasoning.
arXiv Detail & Related papers (2025-12-23T09:28:20Z)
Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection? [57.000348519630286]
Recent advances in mobile edge computing have made it possible to offload-intensive object detection to edge servers equipped with high-accuracy neural networks.<n>This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking.<n>We propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection.
arXiv Detail & Related papers (2025-11-25T04:54:51Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering [55.56674028743782]
Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time.<n>We present EasySteer, a unified framework for high-performance, LLM steering built on vLLM.
arXiv Detail & Related papers (2025-09-29T17:59:07Z)
Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence [4.60587070358843]
This paper presents a novel framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems.<n>The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker.<n>To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference.
arXiv Detail & Related papers (2025-09-16T17:17:03Z)
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z)
OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate. We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.