Related papers: ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation

ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation

URL: http://arxiv.org/abs/2102.13493v1
Date: Fri, 26 Feb 2021 14:06:31 GMT
Title: ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation
Authors: Yu Liu, Fan Yang and Dominique Ginhac
Abstract summary: ACDnet is a compact action detection network targeting real-time edge computing. It exploits the temporal coherence between successive video frames to approximate CNN features rather than naively extracting them. It can robustly achieve detection well above real-time (75 FPS)
Score: 8.013823319651395
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Interpreting human actions requires understanding the spatial and temporal context of the scenes. State-of-the-art action detectors based on Convolutional Neural Network (CNN) have demonstrated remarkable results by adopting two-stream or 3D CNN architectures. However, these methods typically operate in a non-real-time, ofline fashion due to system complexity to reason spatio-temporal information. Consequently, their high computational cost is not compliant with emerging real-world scenarios such as service robots or public surveillance where detection needs to take place at resource-limited edge devices. In this paper, we propose ACDnet, a compact action detection network targeting real-time edge computing which addresses both efficiency and accuracy. It intelligently exploits the temporal coherence between successive video frames to approximate their CNN features rather than naively extracting them. It also integrates memory feature aggregation from past video frames to enhance current detection stability, implicitly modeling long temporal cues over time. Experiments conducted on the public benchmark datasets UCF-24 and JHMDB-21 demonstrate that ACDnet, when integrated with the SSD detector, can robustly achieve detection well above real-time (75 FPS). At the same time, it retains reasonable accuracy (70.92 and 49.53 frame mAP) compared to other top-performing methods using far heavier configurations. Codes will be available at https://github.com/dginhac/ACDnet.

Related papers

CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection [11.714072240331518]
CorrDiff is designed to tackle the challenge of delays in real-time detection systems. It is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames. It meets the stringent real-time processing requirements on all kinds of devices.
arXiv Detail & Related papers (2025-01-09T10:34:25Z)
EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition [0.0]
We present a pose-driven attention-guided multimodal network (EPAM-Net) for recognition in videos. The proposed EPA-MNet provides up to a 72.8x reduction in floating point operations (FLOPs) and up to a 48.6x reduction in the number of network parameters.
arXiv Detail & Related papers (2024-08-10T03:15:24Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Local Compressed Video Stream Learning for Generic Event Boundary Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network. We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z)
Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame. We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information. Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z)
DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera Based Activity Recognition [2.705905918316948]
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. We propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-12-07T00:33:40Z)
Spatio-Temporal-based Context Fusion for Video Anomaly Detection [1.7710335706046505]
Video anomaly aims to discover abnormal events in videos, and the principal objects are target objects such as people and vehicles. Most existing methods only focus on the temporal context, ignoring the role of the spatial context in anomaly detection. This paper proposes a video anomaly detection algorithm based on target-temporal context fusion.
arXiv Detail & Related papers (2022-10-18T04:07:10Z)
Distortion-Aware Network Pruning and Feature Reuse for Real-time Video Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks. Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins. We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Parallel Detection for Efficient Video Analytics at the Edge [5.547133811014004]
Deep Neural Network (DNN) trained object detectors are widely deployed in mission-critical systems for real time video analytics at the edge. A common performance requirement in mission-critical edge services is the near real-time latency of online object detection on edge devices. This paper addresses these problems by exploiting multi-model multi-device detection parallelism for fast object detection in edge systems.
arXiv Detail & Related papers (2021-07-27T02:50:46Z)
Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information. We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
Depthwise Non-local Module for Fast Salient Object Detection Using a Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection. The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.