Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection
- URL: http://arxiv.org/abs/2207.05500v1
- Date: Tue, 12 Jul 2022 12:42:21 GMT
- Title: Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection
- Authors: Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang
- Abstract summary: We propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy for weakly-supervised audio-visual learning.
Our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset.
- Score: 14.779452690026144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised audio-visual violence detection aims to distinguish
snippets containing multimodal violence events with video-level labels. Many
prior works perform audio-visual integration and interaction in an early or
intermediate manner, yet overlooking the modality heterogeneousness over the
weakly-supervised setting. In this paper, we analyze the modality asynchrony
and undifferentiated instances phenomena of the multiple instance learning
(MIL) procedure, and further investigate its negative impact on
weakly-supervised audio-visual learning. To address these issues, we propose a
modality-aware contrastive instance learning with self-distillation (MACIL-SD)
strategy. Specifically, we leverage a lightweight two-stream network to
generate audio and visual bags, in which unimodal background, violent, and
normal instances are clustered into semi-bags in an unsupervised way. Then
audio and visual violent semi-bag representations are assembled as positive
pairs, and violent semi-bags are combined with background and normal instances
in the opposite modality as contrastive negative pairs. Furthermore, a
self-distillation module is applied to transfer unimodal visual knowledge to
the audio-visual model, which alleviates noises and closes the semantic gap
between unimodal and multimodal features. Experiments show that our framework
outperforms previous methods with lower complexity on the large-scale
XD-Violence dataset. Results also demonstrate that our proposed approach can be
used as plug-in modules to enhance other networks. Codes are available at
https://github.com/JustinYuu/MACIL_SD.
Related papers
- Unsupervised Audio-Visual Segmentation with Modality Alignment [42.613786372067814]
Audio-Visual aims to identify, at the pixel level, the object in a visual scene that produces a given sound.
Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability.
We propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models.
arXiv Detail & Related papers (2024-03-21T07:56:09Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic
Space [17.30264225835736]
HyperVD is a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination.
Our framework comprises a detour fusion module for multimodal fusion.
By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent and normal events.
arXiv Detail & Related papers (2023-05-30T07:18:56Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Not made for each other- Audio-Visual Dissonance-based Deepfake
Detection and Localization [7.436429318051601]
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS)
MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video.
Our approach outperforms the state-of-the-art by up to 7%.
arXiv Detail & Related papers (2020-05-29T06:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.