A Spatio-Temporal Attentive Network for Video-Based Crowd Counting
- URL: http://arxiv.org/abs/2208.11339v1
- Date: Wed, 24 Aug 2022 07:40:34 GMT
- Title: A Spatio-Temporal Attentive Network for Video-Based Crowd Counting
- Authors: Marco Avvenuti, Marco Bongiovanni, Luca Ciampi, Fabrizio Falchi,
Claudio Gennaro, Nicola Messina
- Abstract summary: Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images.
By taking advantage of temporal state-of-the-art correlation between consecutive frames, we lowered the attentive-temporal count by 5% and localization error by 7.5% on the widely-used FDST benchmark.
- Score: 5.556665316806146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic people counting from images has recently drawn attention for urban
monitoring in modern Smart Cities due to the ubiquity of surveillance camera
networks. Current computer vision techniques rely on deep learning-based
algorithms that estimate pedestrian densities in still, individual images. Only
a bunch of works take advantage of temporal consistency in video sequences. In
this work, we propose a spatio-temporal attentive neural network to estimate
the number of pedestrians from surveillance videos. By taking advantage of the
temporal correlation between consecutive frames, we lowered state-of-the-art
count error by 5% and localization error by 7.5% on the widely-used FDST
benchmark.
Related papers
- Violence detection in videos using deep recurrent and convolutional neural networks [0.0]
We propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN)
In addition to video frames, we use optical flow computed using the captured sequences.
The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them.
arXiv Detail & Related papers (2024-09-11T19:21:51Z) - DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera
Based Activity Recognition [2.705905918316948]
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years.
We propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-12-07T00:33:40Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Real Time Action Recognition from Video Footage [0.5219568203653523]
Video surveillance cameras have added a new dimension to detect crime.
This research focuses on integrating state-of-the-art Deep Learning methods to ensure a robust pipeline for autonomous surveillance for detecting violent activities.
arXiv Detail & Related papers (2021-12-13T07:27:41Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Motion-guided Non-local Spatial-Temporal Network for Video Crowd
Counting [2.3732259124656903]
We study video crowd counting, which is to estimate the number of objects in all the frames of a video sequence.
We propose Monet, a motion-guided non-local spatial-temporal network for video crowd counting.
Our approach achieves substantially better performance in terms of MAE and MSE as compared with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-04-28T18:05:13Z) - Video Action Recognition Using spatio-temporal optical flow video frames [0.0]
There are many problems associated with recognizing human actions in videos.
This paper focus on spatial and temporal pattern recognition for the classification of videos using Deep Neural Networks.
The final recognition accuracy was about 94%.
arXiv Detail & Related papers (2021-02-05T19:46:49Z) - DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal
Fusion [67.64047158294062]
We propose an online multi-view depth prediction approach on posed video streams.
The scene geometry information computed in the previous time steps is propagated to the current time step.
We outperform the existing state-of-the-art multi-view stereo methods on most of the evaluated metrics.
arXiv Detail & Related papers (2020-12-03T18:54:03Z) - Counting People by Estimating People Flows [135.85747920798897]
We advocate estimating people flows across image locations between consecutive images instead of directly regressing them.
It significantly boosts performance without requiring a more complex architecture.
We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model.
arXiv Detail & Related papers (2020-12-01T12:59:24Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z) - TimeConvNets: A Deep Time Windowed Convolution Neural Network Design for
Real-time Video Facial Expression Recognition [93.0013343535411]
This study explores a novel deep time windowed convolutional neural network design (TimeConvNets) for the purpose of real-time video facial expression recognition.
We show that TimeConvNets can better capture the transient nuances of facial expressions and boost classification accuracy while maintaining a low inference time.
arXiv Detail & Related papers (2020-03-03T20:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.