Related papers: Learning Appearance-motion Normality for Video Anomaly Detection

Learning Appearance-motion Normality for Video Anomaly Detection

URL: http://arxiv.org/abs/2207.13361v1
Date: Wed, 27 Jul 2022 08:30:19 GMT
Title: Learning Appearance-motion Normality for Video Anomaly Detection
Authors: Yang Liu, Jing Liu, Mengyang Zhao, Dingkang Yang, Xiaoguang Zhu, Liang Song
Abstract summary: We propose spatial-temporal memories augmented two-stream auto-encoder framework. It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning. Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
Score: 11.658792932975652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video anomaly detection is a challenging task in the computer vision community. Most single task-based methods do not consider the independence of unique spatial and temporal patterns, while two-stream structures lack the exploration of the correlations. In this paper, we propose spatial-temporal memories augmented two-stream auto-encoder framework, which learns the appearance normality and motion normality independently and explores the correlations via adversarial learning. Specifically, we first design two proxy tasks to train the two-stream structure to extract appearance and motion features in isolation. Then, the prototypical features are recorded in the corresponding spatial and temporal memory pools. Finally, the encoding-decoding network performs adversarial learning with the discriminator to explore the correlations between spatial and temporal patterns. Experimental results show that our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.

Related papers

Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition [6.115044825582411]
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. Previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences. We propose a novel Spatial-Temporal Perception architecture that emphasizes both temporal information and spatial relationships.
arXiv Detail & Related papers (2025-03-06T04:28:11Z)
Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe. Recent deep learning-based VAD models have shown promising results by generating high-resolution frames. We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z)
Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition [10.403751563214113]
STD-CL is a framework to obtain discriminative and semantically distinct representations from the sequences. STD-CL achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks.
arXiv Detail & Related papers (2023-12-23T02:54:41Z)
Holistic Representation Learning for Multitask Trajectory Anomaly Detection [65.72942351514956]
We propose a holistic representation of skeleton trajectories to learn expected motions across segments at different times. We encode temporally occluded trajectories, jointly learn latent representations of the occluded segments, and reconstruct trajectories based on expected motions across different temporal segments.
arXiv Detail & Related papers (2023-11-03T11:32:53Z)
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images. We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Exploiting Spatial-temporal Correlations for Video Anomaly Detection [7.336831373786849]
Video anomaly detection (VAD) remains a challenging task in the pattern recognition community due to the ambiguity and diversity of abnormal events. We introduce a discriminator to perform adversarial learning with the ST-LSTM to enhance the learning capability. Our method achieves competitive performance compared to the state-of-the-art methods with AUCs of 96.7%, 87.8%, and 73.1% on the UCSD2, CUHK Avenue, and ShanghaiTech, respectively.
arXiv Detail & Related papers (2022-11-02T02:13:24Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Multiple Object Tracking with Correlation Learning [16.959379957515974]
We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment. Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning. Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
arXiv Detail & Related papers (2021-04-08T06:48:02Z)
Temporal Memory Relation Network for Workflow Recognition from Surgical Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns. We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z)
Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning [6.523119805288132]
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
arXiv Detail & Related papers (2020-11-23T08:05:39Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.