Spatio-Temporal-based Context Fusion for Video Anomaly Detection
- URL: http://arxiv.org/abs/2210.09572v1
- Date: Tue, 18 Oct 2022 04:07:10 GMT
- Title: Spatio-Temporal-based Context Fusion for Video Anomaly Detection
- Authors: Chao Hu, Weibin Qiu, Weijie Wu and Liqiang Zhu
- Abstract summary: Video anomaly aims to discover abnormal events in videos, and the principal objects are target objects such as people and vehicles.
Most existing methods only focus on the temporal context, ignoring the role of the spatial context in anomaly detection.
This paper proposes a video anomaly detection algorithm based on target-temporal context fusion.
- Score: 1.7710335706046505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video anomaly detection aims to discover abnormal events in videos, and the
principal objects are target objects such as people and vehicles. Each target
in the video data has rich spatio-temporal context information. Most existing
methods only focus on the temporal context, ignoring the role of the spatial
context in anomaly detection. The spatial context information represents the
relationship between the detection target and surrounding targets. Anomaly
detection makes a lot of sense. To this end, a video anomaly detection
algorithm based on target spatio-temporal context fusion is proposed. Firstly,
the target in the video frame is extracted through the target detection network
to reduce background interference. Then the optical flow map of two adjacent
frames is calculated. Motion features are used multiple targets in the video
frame to construct spatial context simultaneously, re-encoding the target
appearance and motion features, and finally reconstructing the above features
through the spatio-temporal dual-stream network, and using the reconstruction
error to represent the abnormal score. The algorithm achieves frame-level AUCs
of 98.5% and 86.3% on the UCSDped2 and Avenue datasets, respectively. On the
UCSDped2 dataset, the spatio-temporal dual-stream network improves frames by
5.1% and 0.3%, respectively, compared to the temporal and spatial stream
networks. After using spatial context encoding, the frame-level AUC is enhanced
by 1%, which verifies the method's effectiveness.
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Learning Appearance-motion Normality for Video Anomaly Detection [11.658792932975652]
We propose spatial-temporal memories augmented two-stream auto-encoder framework.
It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
arXiv Detail & Related papers (2022-07-27T08:30:19Z) - Explore Spatio-temporal Aggregation for Insubstantial Object Detection:
Benchmark Dataset and Baseline [16.59161777626215]
We endeavor on a rarely explored task named Insubstantial Object Detection (IOD), which aims to localize the object with following characteristics.
We construct an IOD-Video dataset comprised of 600 videos (141,017 frames) covering various distances, sizes, visibility, and scenes captured by different spectral ranges.
In addition, we develop atemporal aggregation framework for IOD, in which different backbones are deployed and atemporal aggregation loss (STAloss) is elaborately designed to leverage the consistency along time axis.
arXiv Detail & Related papers (2022-06-23T02:39:09Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.