Related papers: Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

URL: http://arxiv.org/abs/2306.12041v2
Date: Sat, 9 Mar 2024 20:43:33 GMT
Title: Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
Authors: Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah
Abstract summary: We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
Score: 117.61449210940955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1655 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae.

Related papers

Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z)
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z)
Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z)
Multi-level Memory-augmented Appearance-Motion Correspondence Framework for Video Anomaly Detection [1.9511777443446219]
We propose a multi-level memory-augmented appearance-motion correspondence framework. The latent correspondence between appearance and motion is explored via appearance-motion semantics alignment and semantics replacement training. Our framework outperforms the state-of-the-art methods, achieving AUCs of 99.6%, 93.8%, and 76.3% on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.
arXiv Detail & Related papers (2023-03-09T08:43:06Z)
Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation [106.42167050921718]
We propose a very fast frame-level model for anomaly detection in video. It learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. Our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS.
arXiv Detail & Related papers (2022-11-28T17:50:19Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time. "TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches. TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z)
Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection and Slot Filling [6.234581622120001]
We propose a novel Transformer encoder-based architecture with syntactical knowledge encoded for intent detection and slot filling. We encode syntactic knowledge into the Transformer encoder by jointly training it to predict syntactic parse ancestors and part-of-speech of each token via multi-task learning.
arXiv Detail & Related papers (2020-12-21T21:25:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.