SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection
- URL: http://arxiv.org/abs/2207.08003v1
- Date: Sat, 16 Jul 2022 19:25:41 GMT
- Title: SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection
- Authors: Antonio Barbalau, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, Jacob
Dueholm, Bharathkumar Ramachandra, Kamal Nasrollahi, Fahad Shahbaz Khan,
Thomas B. Moeslund, Mubarak Shah
- Abstract summary: We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
- Score: 108.57862846523858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A self-supervised multi-task learning (SSMTL) framework for video anomaly
detection was recently introduced in literature. Due to its highly accurate
results, the method attracted the attention of many researchers. In this work,
we revisit the self-supervised multi-task learning framework, proposing several
updates to the original method. First, we study various detection methods, e.g.
based on detecting high-motion regions using optical flow or background
subtraction, since we believe the currently used pre-trained YOLOv3 is
suboptimal, e.g. objects in motion or objects from unknown classes are never
detected. Second, we modernize the 3D convolutional backbone by introducing
multi-head self-attention modules, inspired by the recent success of vision
transformers. As such, we alternatively introduce both 2D and 3D convolutional
vision transformer (CvT) blocks. Third, in our attempt to further improve the
model, we study additional self-supervised learning tasks, such as predicting
segmentation maps through knowledge distillation, solving jigsaw puzzles,
estimating body pose through knowledge distillation, predicting masked regions
(inpainting), and adversarial learning with pseudo-anomalies. We conduct
experiments to assess the performance impact of the introduced changes. Upon
finding more promising configurations of the framework, dubbed SSMTL++v1 and
SSMTL++v2, we extend our preliminary experiments to more data sets,
demonstrating that our performance gains are consistent across all data sets.
In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the
state-of-the-art performance to a new level.
Related papers
- ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D
Object Detection [15.204935788297226]
ODM3D framework entails cross-modal knowledge distillation at various levels to inject LiDAR-domain knowledge into a monocular detector during training.
By identifying foreground sparsity as the main culprit behind existing methods' suboptimal training, we exploit the precise localisation information embedded in LiDAR points.
Our method ranks 1st in both KITTI validation and test benchmarks, significantly surpassing all existing monocular methods, supervised or semi-supervised.
arXiv Detail & Related papers (2023-10-28T07:12:09Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - Supervising Remote Sensing Change Detection Models with 3D Surface
Semantics [1.8782750537161614]
We propose Contrastive Surface-Image Pretraining (CSIP) for joint learning using optical RGB and above ground level (AGL) map pairs.
We then evaluate these pretrained models on several building segmentation and change detection datasets to show that our method does, in fact, extract features relevant to downstream applications.
arXiv Detail & Related papers (2022-02-26T23:35:43Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - Static-Dynamic Co-Teaching for Class-Incremental 3D Object Detection [71.18882803642526]
Deep learning approaches have shown remarkable performance in the 3D object detection task.
They suffer from a catastrophic performance drop when incrementally learning new classes without revisiting the old data.
This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios.
We present the first solution - SDCoT, a novel static-dynamic co-teaching method.
arXiv Detail & Related papers (2021-12-14T09:03:41Z) - Anomaly Detection in Video via Self-Supervised and Multi-Task Learning [113.81927544121625]
Anomaly detection in video is a challenging computer vision problem.
In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level.
arXiv Detail & Related papers (2020-11-15T10:21:28Z) - MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects.
Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.