Not only Look, but also Listen: Learning Multimodal Violence Detection
under Weak Supervision
- URL: http://arxiv.org/abs/2007.04687v2
- Date: Mon, 13 Jul 2020 04:16:22 GMT
- Title: Not only Look, but also Listen: Learning Multimodal Violence Detection
under Weak Supervision
- Authors: Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu,
Zhiwei Yang
- Abstract summary: We first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours.
We propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features.
Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark.
- Score: 10.859792341257931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Violence detection has been studied in computer vision for years. However,
previous work are either superficial, e.g., classification of short-clips, and
the single scenario, or undersupplied, e.g., the single modality, and
hand-crafted features based multimodality. To address this problem, in this
work we first release a large-scale and multi-scene dataset named XD-Violence
with a total duration of 217 hours, containing 4754 untrimmed videos with audio
signals and weak labels. Then we propose a neural network containing three
parallel branches to capture different relations among video snippets and
integrate features, where holistic branch captures long-range dependencies
using similarity prior, localized branch captures local positional relation
using proximity prior, and score branch dynamically captures the closeness of
predicted score. Besides, our method also includes an approximator to meet the
needs of online detection. Our method outperforms other state-of-the-art
methods on our released dataset and other existing benchmark. Moreover,
extensive experimental results also show the positive effect of multimodal
(audio-visual) input and modeling relationships. The code and dataset will be
released in https://roc-ng.github.io/XD-Violence/.
Related papers
- Centre Stage: Centricity-based Audio-Visual Temporal Action Detection [26.42447737005981]
We explore strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities.
We propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score.
arXiv Detail & Related papers (2023-11-28T03:02:00Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - TempNet: Temporal Attention Towards the Detection of Animal Behaviour in
Videos [63.85815474157357]
We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos.
TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder.
We demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events.
arXiv Detail & Related papers (2022-11-17T23:55:12Z) - Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme.
Our network deeply embeds cross-image feature correlation in multiple layers of the feature network.
Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z) - Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation [121.5383855764944]
We use contrastive self-supervised learning to extract rich latent vectors from single-view videos.
We show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space.
Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.
arXiv Detail & Related papers (2020-12-02T20:27:35Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Look and Listen: A Multi-modality Late Fusion Approach to Scene
Classification for Autonomous Machines [5.452798072984612]
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion.
The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects.
We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration.
arXiv Detail & Related papers (2020-07-11T16:47:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.