JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos
- URL: http://arxiv.org/abs/2405.02961v2
- Date: Sat, 3 Aug 2024 18:49:02 GMT
- Title: JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos
- Authors: Pietro Nardelli, Danilo Comminiello,
- Abstract summary: Violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes.
We introduce JOSENet, a self-supervised framework that provides outstanding performance for violence detection in surveillance videos.
- Score: 4.94659999696881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research community. Compared to other action recognition tasks, violence detection in surveillance videos presents additional issues, such as the wide variety of real fight scenes. Unfortunately, existing datasets for violence detection are relatively small in comparison to those for other action recognition tasks. Moreover, surveillance footage often features different individuals in each video and varying backgrounds for each camera. In addition, fast detection of violent actions in real-life surveillance videos is crucial to prevent adverse outcomes, thus necessitating models that are optimized for reduced memory usage and computational costs. These challenges complicate the application of traditional action recognition methods. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model processes two spatiotemporal video streams, namely RGB frames and optical flows, and incorporates a new regularized self-supervised learning approach for videos. JOSENet demonstrates improved performance compared to state-of-the-art methods, while utilizing only one-fourth of the frames per video segment and operating at a reduced frame rate. The source code is available at https://github.com/ispamm/JOSENet.
Related papers
- Video Vision Transformers for Violence Detection [0.0]
The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences.
The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed.
arXiv Detail & Related papers (2022-09-08T04:44:01Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Detecting Violence in Video Based on Deep Features Fusion Technique [0.30458514384586394]
This work proposed a novel method to detect violence using a fusion tech-nique of two convolutional neural networks (CNNs)
The performance of the proposed method is evaluated using three standard benchmark datasets in terms of detection accuracy.
arXiv Detail & Related papers (2022-04-15T12:51:20Z) - Real Time Action Recognition from Video Footage [0.5219568203653523]
Video surveillance cameras have added a new dimension to detect crime.
This research focuses on integrating state-of-the-art Deep Learning methods to ensure a robust pipeline for autonomous surveillance for detecting violent activities.
arXiv Detail & Related papers (2021-12-13T07:27:41Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - Enhanced Few-shot Learning for Intrusion Detection in Railway Video
Surveillance [16.220077781635748]
An enhanced model-agnostic meta-learner is trained using both the original video frames and segmented masks of track area extracted from the video.
Numerical results show that the enhanced meta-learner successfully adapts unseen scene with only few newly collected video frame samples.
arXiv Detail & Related papers (2020-11-09T08:59:15Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - TinyVIRAT: Low-resolution Video Action Recognition [70.37277191524755]
In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions.
We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities.
We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach.
arXiv Detail & Related papers (2020-07-14T21:09:18Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z) - Multi-Modal Video Forensic Platform for Investigating Post-Terrorist
Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence.
We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z) - Vision-based Fight Detection from Surveillance Cameras [6.982738885923204]
This paper explores LSTM-based approaches to solve fight scene classification problem.
A new dataset is collected, which consists of fight scenes from surveillance camera videos available at YouTube.
It is observed that the proposed approach, which integrates Xception model, Bi-LSTM, and attention, improves the state-of-the-art accuracy for fight scene classification.
arXiv Detail & Related papers (2020-02-11T12:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.