Related papers: Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

URL: http://arxiv.org/abs/2404.09245v1
Date: Sun, 14 Apr 2024 13:14:13 GMT
Title: Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
Authors: Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia,
Abstract summary: We introduce Arena, an end-to-end edge-assisted video inference acceleration system based on Vision Transformer (ViT) Our findings reveal that Arena can boost inference speeds by up to $1.58times$ and $1.82times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
Score: 19.874783636389065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.

Related papers

Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse [7.283352519499699]
This paper introduces D'eja Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames.<n>At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities.<n>We show that D'eja Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
arXiv Detail & Related papers (2025-06-17T01:59:10Z)
Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z)
AdaVid: Adaptive Video-Language Pretraining [25.893795920759572]
We introduce AdaVid, a framework for learning efficient video encoders on compute-constrained edge devices. AdaVid learns efficient video encoders that can dynamically adapt their computational footprint based on available resources.
arXiv Detail & Related papers (2025-04-16T22:19:50Z)
Towards Real-Time Open-Vocabulary Video Instance Segmentation [88.04508795121681]
We propose a new method for performing open-vocabulary video instance segmentation (OV-VIS) in real-time. TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality.
arXiv Detail & Related papers (2024-12-05T18:53:13Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.<n>We provide a comprehensive analysis of 3D Attention in the context of video prediction.<n>The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. We build a simple and effective framework for streaming perception. Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z)
Deformable VisTR: Spatio temporal deformable attention for video instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z)
Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal. Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z)
VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification. VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z)
Towards Unsupervised Fine-Tuning for Edge Video Analytics [1.1091582432763736]
We propose a method for improving accuracy of edge models without any extra compute cost by means of automatic model specialization. Results show that our method can automatically improve accuracy of pre-trained models by an average of 21%.
arXiv Detail & Related papers (2021-04-14T12:57:40Z)
ApproxDet: Content and Contention-Aware Approximate Object Detection for Mobiles [19.41234144545467]
We introduce ApproxDet, an adaptive video object detection framework for mobile devices to meet accuracy-latency requirements. We evaluate ApproxDet on a large benchmark video dataset and compare quantitatively to AdaScale and YOLOv3. We find that ApproxDet is able to adapt to a wide variety of contention and content characteristics and outshines all baselines.
arXiv Detail & Related papers (2020-10-21T04:11:05Z)
Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z)
Scene-Adaptive Video Frame Interpolation via Meta-Learning [54.87696619177496]
We propose to adapt the model to each video by making use of additional information that is readily available at test time. We obtain significant performance gains with only a single gradient update without any additional parameters.
arXiv Detail & Related papers (2020-04-02T02:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.