Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors
- URL: http://arxiv.org/abs/2306.12041v2
- Date: Sat, 9 Mar 2024 20:43:33 GMT
- Title: Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors
- Authors: Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu,
Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah
- Abstract summary: We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
- Score: 117.61449210940955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an efficient abnormal event detection model based on a lightweight
masked auto-encoder (AE) applied at the video frame level. The novelty of the
proposed model is threefold. First, we introduce an approach to weight tokens
based on motion gradients, thus shifting the focus from the static background
scene to the foreground objects. Second, we integrate a teacher decoder and a
student decoder into our architecture, leveraging the discrepancy between the
outputs given by the two decoders to improve anomaly detection. Third, we
generate synthetic abnormal events to augment the training videos, and task the
masked AE model to jointly reconstruct the original frames (without anomalies)
and the corresponding pixel-level anomaly maps. Our design leads to an
efficient and effective model, as demonstrated by the extensive experiments
carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2.
The empirical results show that our model achieves an excellent trade-off
between speed and accuracy, obtaining competitive AUC scores, while processing
1655 FPS. Hence, our model is between 8 and 70 times faster than competing
methods. We also conduct an ablation study to justify our design. Our code is
freely available at: https://github.com/ristea/aed-mae.
Related papers
- 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - Multi-level Memory-augmented Appearance-Motion Correspondence Framework
for Video Anomaly Detection [1.9511777443446219]
We propose a multi-level memory-augmented appearance-motion correspondence framework.
The latent correspondence between appearance and motion is explored via appearance-motion semantics alignment and semantics replacement training.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 99.6%, 93.8%, and 76.3% on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.
arXiv Detail & Related papers (2023-03-09T08:43:06Z) - Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation [106.42167050921718]
We propose a very fast frame-level model for anomaly detection in video.
It learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models.
Our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS.
arXiv Detail & Related papers (2022-11-28T17:50:19Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z) - Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection
and Slot Filling [6.234581622120001]
We propose a novel Transformer encoder-based architecture with syntactical knowledge encoded for intent detection and slot filling.
We encode syntactic knowledge into the Transformer encoder by jointly training it to predict syntactic parse ancestors and part-of-speech of each token via multi-task learning.
arXiv Detail & Related papers (2020-12-21T21:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.