Not made for each other- Audio-Visual Dissonance-based Deepfake
Detection and Localization
- URL: http://arxiv.org/abs/2005.14405v3
- Date: Sat, 20 Mar 2021 15:09:49 GMT
- Title: Not made for each other- Audio-Visual Dissonance-based Deepfake
Detection and Localization
- Authors: Komal Chugh, Parul Gupta, Abhinav Dhall and Ramanathan Subramanian
- Abstract summary: We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS)
MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video.
Our approach outperforms the state-of-the-art by up to 7%.
- Score: 7.436429318051601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose detection of deepfake videos based on the dissimilarity between
the audio and visual modalities, termed as the Modality Dissonance Score (MDS).
We hypothesize that manipulation of either modality will lead to dis-harmony
between the two modalities, eg, loss of lip-sync, unnatural facial and lip
movements, etc. MDS is computed as an aggregate of dissimilarity scores between
audio and visual segments in a video. Discriminative features are learnt for
the audio and visual channels in a chunk-wise manner, employing the
cross-entropy loss for individual modalities, and a contrastive loss that
models inter-modality similarity. Extensive experiments on the DFDC and
DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art
by up to 7%. We also demonstrate temporal forgery localization, and show how
our technique identifies the manipulated video segments.
Related papers
- A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection [14.779452690026144]
We propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy for weakly-supervised audio-visual learning.
Our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset.
arXiv Detail & Related papers (2022-07-12T12:42:21Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.