Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
- URL: http://arxiv.org/abs/2404.09245v2
- Date: Thu, 26 Sep 2024 01:25:22 GMT
- Title: Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
- Authors: Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Ren Jin, Yuanqing Xia,
- Abstract summary: We introduce an end-to-end edge-assisted video inference acceleration system based on Vision Transformer (ViT)
Our findings reveal that Arena can boost inference speeds by up to 1.58(times) and 1.82(times) on average while consuming only 47% and 31% of the bandwidth, respectively, all with high inference accuracy.
- Score: 18.042752812489276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58\(\times\) and 1.82\(\times\) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
Related papers
- Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification.
VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z) - Towards Unsupervised Fine-Tuning for Edge Video Analytics [1.1091582432763736]
We propose a method for improving accuracy of edge models without any extra compute cost by means of automatic model specialization.
Results show that our method can automatically improve accuracy of pre-trained models by an average of 21%.
arXiv Detail & Related papers (2021-04-14T12:57:40Z) - ApproxDet: Content and Contention-Aware Approximate Object Detection for
Mobiles [19.41234144545467]
We introduce ApproxDet, an adaptive video object detection framework for mobile devices to meet accuracy-latency requirements.
We evaluate ApproxDet on a large benchmark video dataset and compare quantitatively to AdaScale and YOLOv3.
We find that ApproxDet is able to adapt to a wide variety of contention and content characteristics and outshines all baselines.
arXiv Detail & Related papers (2020-10-21T04:11:05Z) - Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks.
We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z) - Scene-Adaptive Video Frame Interpolation via Meta-Learning [54.87696619177496]
We propose to adapt the model to each video by making use of additional information that is readily available at test time.
We obtain significant performance gains with only a single gradient update without any additional parameters.
arXiv Detail & Related papers (2020-04-02T02:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.