Related papers: JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos

URL: http://arxiv.org/abs/2405.02961v1
Date: Sun, 5 May 2024 15:01:00 GMT
Title: JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos
Authors: Pietro Nardelli, Danilo Comminiello,
Abstract summary: We introduce JOSENet, a novel self-supervised framework for violence detection in surveillance videos. JOSENet receives twotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. It provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate.
Score: 4.94659999696881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at https://github.com/ispamm/JOSENet.

Related papers

Streamlining Video Analysis for Efficient Violence Detection [1.444946491007292]
This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem.
arXiv Detail & Related papers (2024-11-29T06:32:36Z)
Video Vision Transformers for Violence Detection [0.0]
The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences. The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed.
arXiv Detail & Related papers (2022-09-08T04:44:01Z)
Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound. Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z)
Detecting Violence in Video Based on Deep Features Fusion Technique [0.30458514384586394]
This work proposed a novel method to detect violence using a fusion tech-nique of two convolutional neural networks (CNNs) The performance of the proposed method is evaluated using three standard benchmark datasets in terms of detection accuracy.
arXiv Detail & Related papers (2022-04-15T12:51:20Z)
Real Time Action Recognition from Video Footage [0.5219568203653523]
Video surveillance cameras have added a new dimension to detect crime. This research focuses on integrating state-of-the-art Deep Learning methods to ensure a robust pipeline for autonomous surveillance for detecting violent activities.
arXiv Detail & Related papers (2021-12-13T07:27:41Z)
JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks. We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection. We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z)
Enhanced Few-shot Learning for Intrusion Detection in Railway Video Surveillance [16.220077781635748]
An enhanced model-agnostic meta-learner is trained using both the original video frames and segmented masks of track area extracted from the video. Numerical results show that the enhanced meta-learner successfully adapts unseen scene with only few newly collected video frame samples.
arXiv Detail & Related papers (2020-11-09T08:59:15Z)
Robust Unsupervised Video Anomaly Detection by Multi-Path Frame Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design. Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z)
TinyVIRAT: Low-resolution Video Action Recognition [70.37277191524755]
In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach.
arXiv Detail & Related papers (2020-07-14T21:09:18Z)
Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos. The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging. We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z)
Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z)
Vision-based Fight Detection from Surveillance Cameras [6.982738885923204]
This paper explores LSTM-based approaches to solve fight scene classification problem. A new dataset is collected, which consists of fight scenes from surveillance camera videos available at YouTube. It is observed that the proposed approach, which integrates Xception model, Bi-LSTM, and attention, improves the state-of-the-art accuracy for fight scene classification.
arXiv Detail & Related papers (2020-02-11T12:56:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.