Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
- URL: http://arxiv.org/abs/2404.09245v1
- Date: Sun, 14 Apr 2024 13:14:13 GMT
- Title: Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
- Authors: Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia,
- Abstract summary: We introduce Arena, an end-to-end edge-assisted video inference acceleration system based on Vision Transformer (ViT)
Our findings reveal that Arena can boost inference speeds by up to $1.58times$ and $1.82times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
- Score: 19.874783636389065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
Related papers
- A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Streaming Video Analytics On The Edge With Asynchronous Cloud Support [2.7456483236562437]
We propose a novel edge-cloud fusion algorithm that fuses edge and cloud predictions, achieving low latency and high accuracy.
We focus on object detection in videos (applicable in many video analytics scenarios) and show that the fused edge-cloud predictions can outperform the accuracy of edge-only and cloud-only scenarios by as much as 50%.
arXiv Detail & Related papers (2022-10-04T06:22:13Z) - Efficient Heterogeneous Video Segmentation at the Edge [2.4378845585726903]
We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute.
Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures.
We analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU.
arXiv Detail & Related papers (2022-08-24T17:01:09Z) - Patch-based Object-centric Transformers for Efficient Video Generation [71.55412580325743]
We present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture.
We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos.
Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information.
arXiv Detail & Related papers (2022-06-08T16:29:59Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Towards Unsupervised Fine-Tuning for Edge Video Analytics [1.1091582432763736]
We propose a method for improving accuracy of edge models without any extra compute cost by means of automatic model specialization.
Results show that our method can automatically improve accuracy of pre-trained models by an average of 21%.
arXiv Detail & Related papers (2021-04-14T12:57:40Z) - ApproxDet: Content and Contention-Aware Approximate Object Detection for
Mobiles [19.41234144545467]
We introduce ApproxDet, an adaptive video object detection framework for mobile devices to meet accuracy-latency requirements.
We evaluate ApproxDet on a large benchmark video dataset and compare quantitatively to AdaScale and YOLOv3.
We find that ApproxDet is able to adapt to a wide variety of contention and content characteristics and outshines all baselines.
arXiv Detail & Related papers (2020-10-21T04:11:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.