Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly
Detection
- URL: http://arxiv.org/abs/2303.18044v2
- Date: Tue, 4 Apr 2023 07:05:46 GMT
- Title: Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly
Detection
- Authors: Shengyang Sun, Xiaojin Gong
- Abstract summary: Weakly supervised anomaly detection (WS-VAD) is a challenging problem that aims to learn VAD models only with video-level annotations.
Our proposed method is able to better deal with anomalies with varying durations as well as subtle anomalies.
- Score: 14.721615285883423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised video anomaly detection (WS-VAD) is a challenging problem
that aims to learn VAD models only with video-level annotations. In this work,
we propose a Long-Short Temporal Co-teaching (LSTC) method to address the
WS-VAD problem. It constructs two tubelet-based spatio-temporal transformer
networks to learn from short- and long-term video clips respectively. Each
network is trained with respect to a multiple instance learning (MIL)-based
ranking loss, together with a cross-entropy loss when clip-level pseudo labels
are available. A co-teaching strategy is adopted to train the two networks.
That is, clip-level pseudo labels generated from each network are used to
supervise the other one at the next training round, and the two networks are
learned alternatively and iteratively. Our proposed method is able to better
deal with the anomalies with varying durations as well as subtle anomalies.
Extensive experiments on three public datasets demonstrate that our method
outperforms state-of-the-art WS-VAD methods.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Capturing Temporal Information in a Single Frame: Channel Sampling
Strategies for Action Recognition [19.220288614585147]
We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost.
We propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes.
Our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing.
arXiv Detail & Related papers (2022-01-25T15:24:37Z) - Video Abnormal Event Detection by Learning to Complete Visual Cloze
Tests [50.1446994599891]
Video abnormal event (VAD) is a vital semi-supervised task that requires learning with only roughly labeled normal videos.
We propose a novel approach named visual cloze (VCC) which performs VAD by learning to complete "visual cloze tests" (VCTs)
We show that VCC achieves state-of-the-art VAD performance.
arXiv Detail & Related papers (2021-08-05T04:05:36Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Weakly-supervised Video Anomaly Detection with Contrastive Learning of
Long and Short-range Temporal Features [26.474395581531194]
We propose a novel method, named Multi-scale Temporal Network trained with top-K Contrastive Multiple Instance Learning (MTN-KMIL)
Our method outperforms several state-of-the-art methods by a large margin on three benchmark data sets.
arXiv Detail & Related papers (2021-01-25T12:04:00Z) - Recurrent Multi-view Alignment Network for Unsupervised Surface
Registration [79.72086524370819]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data.
We propose to represent the non-rigid transformation with a point-wise combination of several rigid transformations.
We also introduce a differentiable loss function that measures the 3D shape similarity on the projected multi-view 2D depth images.
arXiv Detail & Related papers (2020-11-24T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.