Related papers: ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers

ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers

URL: http://arxiv.org/abs/2601.21362v1
Date: Thu, 29 Jan 2026 07:43:12 GMT
Title: ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers
Authors: Miao Zhang, Guanzhen Wu, Hao Fang, Yifei Zhu, Fangxin Wang, Ruixiao Zhang, Jiangchuan Liu,
Abstract summary: We introduce ViTMAlis, a device-to-edge offloading framework for vision transformers (ViTs)<n>ViTMAlis reduces end-to-end offloading latency while improving user-perceived rendering accuracy.<n>We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices.
Score: 28.741078014867323
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between speed and accuracy. Building on this, we introduce ViTMAlis, a ViT-native device-to-edge offloading framework that dynamically adapts to network conditions and video content to jointly reduce transmission and inference delays. We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices. Extensive experiments demonstrate that, compared to state-of-the-art accuracy-centric, content-aware, and latency-adaptive baselines, ViTMAlis significantly reduces end-to-end offloading latency while improving user-perceived rendering accuracy, providing a practical foundation for next-generation mobile intelligence.

Related papers

VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation [61.82502719679122]
We introduce VLNVerse, a benchmark for Versatile, Embodied, Realistic Simulation, and Evaluation.<n>VLNVerse redefines VLN as a scalable, full-stack embodied AI problem.<n>We propose a novel unified multi-task model capable of addressing all tasks within the benchmark.
arXiv Detail & Related papers (2025-12-22T04:27:26Z)
OneTrack-M: A multitask approach to transformer-based MOT models [0.0]
Multi-Object Tracking (MOT) is a critical problem in computer vision.<n>OneTrack-M is a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy.
arXiv Detail & Related papers (2025-02-06T20:02:06Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
LinFormer: A Linear-based Lightweight Transformer Architecture For Time-Aware MIMO Channel Prediction [39.12741712294741]
6th generation (6G) mobile networks bring new challenges in supporting high-mobility communications. We present LinFormer, an innovative channel prediction framework based on a scalable, all-linear, encoder-only Transformer model. Our approach achieves a substantial reduction in computational complexity while maintaining high prediction accuracy, making it more suitable for deployment in cost-effective base stations (BS)
arXiv Detail & Related papers (2024-10-28T13:04:23Z)
Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies. We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z)
Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach [71.67506068703314]
Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Traditionally, urban mobility data has been structured as videos, treating longitude and latitude as fundamental pixels. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex time series.
arXiv Detail & Related papers (2023-12-04T07:39:05Z)
Soft Error Reliability Analysis of Vision Transformers [14.132398744731635]
Vision Transformers (ViTs) that leverage self-attention mechanism have shown superior performance on many classical vision tasks. Existing ViTs works mainly optimize performance and accuracy, but ViTs reliability issues induced by soft errors have generally been overlooked. In this work, we study the reliability of ViTs and investigate the vulnerability from different architecture granularities.
arXiv Detail & Related papers (2023-02-21T06:17:40Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures. This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference. We show that the improved robustness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.