JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos
- URL: http://arxiv.org/abs/2405.02961v1
- Date: Sun, 5 May 2024 15:01:00 GMT
- Title: JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos
- Authors: Pietro Nardelli, Danilo Comminiello,
- Abstract summary: We introduce JOSENet, a novel self-supervised framework for violence detection in surveillance videos.
JOSENet receives twotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos.
It provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate.
- Score: 4.94659999696881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at https://github.com/ispamm/JOSENet.
Related papers
- Video Vision Transformers for Violence Detection [0.0]
The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences.
The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed.
arXiv Detail & Related papers (2022-09-08T04:44:01Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Real Time Action Recognition from Video Footage [0.5219568203653523]
Video surveillance cameras have added a new dimension to detect crime.
This research focuses on integrating state-of-the-art Deep Learning methods to ensure a robust pipeline for autonomous surveillance for detecting violent activities.
arXiv Detail & Related papers (2021-12-13T07:27:41Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Anomaly Recognition from surveillance videos using 3D Convolutional
Neural Networks [0.0]
Anomalous activity recognition deals with identifying the patterns and events that vary from the normal stream.
This study provides a simple, yet effective approach for learning features using deep 3-dimensional convolutional networks (3D ConvNets) trained on the University of Central Florida (UCF) Crime video dataset.
arXiv Detail & Related papers (2021-01-04T16:32:48Z) - Enhanced Few-shot Learning for Intrusion Detection in Railway Video
Surveillance [16.220077781635748]
An enhanced model-agnostic meta-learner is trained using both the original video frames and segmented masks of track area extracted from the video.
Numerical results show that the enhanced meta-learner successfully adapts unseen scene with only few newly collected video frame samples.
arXiv Detail & Related papers (2020-11-09T08:59:15Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - TinyVIRAT: Low-resolution Video Action Recognition [70.37277191524755]
In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions.
We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities.
We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach.
arXiv Detail & Related papers (2020-07-14T21:09:18Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z) - Vision-based Fight Detection from Surveillance Cameras [6.982738885923204]
This paper explores LSTM-based approaches to solve fight scene classification problem.
A new dataset is collected, which consists of fight scenes from surveillance camera videos available at YouTube.
It is observed that the proposed approach, which integrates Xception model, Bi-LSTM, and attention, improves the state-of-the-art accuracy for fight scene classification.
arXiv Detail & Related papers (2020-02-11T12:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.