Related papers: Vision-Language Models on the Edge for Real-Time Robotic Perception

Vision-Language Models on the Edge for Real-Time Robotic Perception

URL: http://arxiv.org/abs/2601.14921v1
Date: Wed, 21 Jan 2026 12:09:48 GMT
Title: Vision-Language Models on the Edge for Real-Time Robotic Perception
Authors: Sarat Ahmad, Maryam Hafeez, Syed Ali Raza Zaidi,
Abstract summary: Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing, offers a pathway to address these challenges.<n>This work investigates the deployment of Vision-Language Models on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed.<n>Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5%.
Score: 0.22940141855172028
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate Qwen2-VL-2B-Instruct, a compact model optimized for resource-constrained environments, which achieves sub-second responsiveness, cutting latency by more than half but at the cost of accuracy.

Related papers

SC-MII: Infrastructure LiDAR-based 3D Object Detection on Edge Devices for Split Computing with Multiple Intermediate Outputs Integration [1.1761374316223123]
3D object detection using LiDAR-based point cloud data and deep neural networks is essential in autonomous driving technology.<n> deploying state-of-the-art models on edge devices present challenges due to high computational demands and energy consumption.<n>This paper proposes SC-MII, multiple infrastructure LiDAR-based 3D object detection on edge devices for Split Computing with Multiple Intermediate outputs Integration.
arXiv Detail & Related papers (2026-01-12T01:17:01Z)
Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection? [57.000348519630286]
Recent advances in mobile edge computing have made it possible to offload-intensive object detection to edge servers equipped with high-accuracy neural networks.<n>This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking.<n>We propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection.
arXiv Detail & Related papers (2025-11-25T04:54:51Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits [11.398891065175686]
Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots.<n>LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy.<n>We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device.
arXiv Detail & Related papers (2025-05-27T14:55:16Z)
Generative AI on the Edge: Architecture and Performance Evaluation [0.3999851878220877]
6G's AI native vision of embedding advance intelligence in the network requires a systematic evaluation of Generative AI (GenAI) models on edge devices.<n>This research investigates computationally demanding Large Language Models (LLMs) inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN.
arXiv Detail & Related papers (2024-11-18T16:09:01Z)
Streaming Video Analytics On The Edge With Asynchronous Cloud Support [2.7456483236562437]
We propose a novel edge-cloud fusion algorithm that fuses edge and cloud predictions, achieving low latency and high accuracy. We focus on object detection in videos (applicable in many video analytics scenarios) and show that the fused edge-cloud predictions can outperform the accuracy of edge-only and cloud-only scenarios by as much as 50%.
arXiv Detail & Related papers (2022-10-04T06:22:13Z)
Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge. We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z)
EffCNet: An Efficient CondenseNet for Image Classification on NXP BlueBox [0.0]
Edge devices offer limited processing power due to their inexpensive hardware, and limited cooling and computational resources. We propose a novel deep convolutional neural network architecture called EffCNet for edge devices.
arXiv Detail & Related papers (2021-11-28T21:32:31Z)
Computational Intelligence and Deep Learning for Next-Generation Edge-Enabled Industrial IoT [51.68933585002123]
We investigate how to deploy computational intelligence and deep learning (DL) in edge-enabled industrial IoT networks. In this paper, we propose a novel multi-exit-based federated edge learning (ME-FEEL) framework. In particular, the proposed ME-FEEL can achieve an accuracy gain up to 32.7% in the industrial IoT networks with the severely limited resources.
arXiv Detail & Related papers (2021-10-28T08:14:57Z)
SABER: Data-Driven Motion Planner for Autonomously Navigating Heterogeneous Robots [112.2491765424719]
We present an end-to-end online motion planning framework that uses a data-driven approach to navigate a heterogeneous robot team towards a global goal. We use model predictive control (SMPC) to calculate control inputs that satisfy robot dynamics, and consider uncertainty during obstacle avoidance with chance constraints. recurrent neural networks are used to provide a quick estimate of future state uncertainty considered in the SMPC finite-time horizon solution. A Deep Q-learning agent is employed to serve as a high-level path planner, providing the SMPC with target positions that move the robots towards a desired global goal.
arXiv Detail & Related papers (2021-08-03T02:56:21Z)
Risk-Averse MPC via Visual-Inertial Input and Recurrent Networks for Online Collision Avoidance [95.86944752753564]
We propose an online path planning architecture that extends the model predictive control (MPC) formulation to consider future location uncertainties. Our algorithm combines an object detection pipeline with a recurrent neural network (RNN) which infers the covariance of state estimates. The robustness of our methods is validated on complex quadruped robot dynamics and can be generally applied to most robotic platforms.
arXiv Detail & Related papers (2020-07-28T07:34:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.