Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic
Space
- URL: http://arxiv.org/abs/2305.18797v3
- Date: Tue, 13 Feb 2024 16:00:01 GMT
- Title: Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic
Space
- Authors: Xiaogang Peng, Hao Wen, Yikai Luo, Xiao Zhou, Keyang Yu, Ping Yang,
Zizhao Wu
- Abstract summary: HyperVD is a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination.
Our framework comprises a detour fusion module for multimodal fusion.
By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events.
- Score: 17.30264225835736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the task of weakly supervised audio-visual violence
detection has gained considerable attention. The goal of this task is to
identify violent segments within multimodal data based on video-level labels.
Despite advances in this field, traditional Euclidean neural networks, which
have been used in prior research, encounter difficulties in capturing highly
discriminative representations due to limitations of the feature space. To
overcome this, we propose HyperVD, a novel framework that learns snippet
embeddings in hyperbolic space to improve model discrimination. Our framework
comprises a detour fusion module for multimodal fusion, effectively alleviating
modality inconsistency between audio and visual signals. Additionally, we
contribute two branches of fully hyperbolic graph convolutional networks that
excavate feature similarities and temporal relationships among snippets in
hyperbolic space. By learning snippet representations in this space, the
framework effectively learns semantic discrepancies between violent and normal
events. Extensive experiments on the XD-Violence benchmark demonstrate that our
method outperforms state-of-the-art methods by a sizable margin.
Related papers
- Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection [41.37736889402566]
We develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised Video Violence Detection (VVD)
Our method captures the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features.
arXiv Detail & Related papers (2024-09-28T05:54:20Z) - Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities.
We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony.
Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Spatial-Frequency Discriminability for Revealing Adversarial Perturbations [53.279716307171604]
Vulnerability of deep neural networks to adversarial perturbations has been widely perceived in the computer vision community.
Current algorithms typically detect adversarial patterns through discriminative decomposition for natural and adversarial data.
We propose a discriminative detector relying on a spatial-frequency Krawtchouk decomposition.
arXiv Detail & Related papers (2023-05-18T10:18:59Z) - Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection [14.779452690026144]
We propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy for weakly-supervised audio-visual learning.
Our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset.
arXiv Detail & Related papers (2022-07-12T12:42:21Z) - MC-LCR: Multi-modal contrastive classification by locally correlated
representations for effective face forgery detection [11.124150983521158]
We propose a novel framework named Multi-modal Contrastive Classification by Locally Correlated Representations.
Our MC-LCR aims to amplify implicit local discrepancies between authentic and forged faces from both spatial and frequency domains.
We achieve state-of-the-art performance and demonstrate the robustness and generalization of our method.
arXiv Detail & Related papers (2021-10-07T09:24:12Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - DIRV: Dense Interaction Region Voting for End-to-End Human-Object
Interaction Detection [53.40028068801092]
We propose a novel one-stage HOI detection approach based on a new concept called interaction region for the HOI problem.
Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair.
In order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy.
arXiv Detail & Related papers (2020-10-02T13:57:58Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.