Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal
and Multimodal Detectors
- URL: http://arxiv.org/abs/2109.02993v1
- Date: Tue, 7 Sep 2021 11:00:20 GMT
- Title: Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal
and Multimodal Detectors
- Authors: Hasam Khalid and Minha Kim and Shahroz Tariq and Simon S. Woo
- Abstract summary: Deepfakes can cause security and privacy issues.
New domain of cloning human voices using deep-learning technologies is also emerging.
To develop a good deepfake detector, we need a detector that can detect deepfakes of multiple modalities.
- Score: 18.862258543488355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant advancements made in the generation of deepfakes have caused
security and privacy issues. Attackers can easily impersonate a person's
identity in an image by replacing his face with the target person's face.
Moreover, a new domain of cloning human voices using deep-learning technologies
is also emerging. Now, an attacker can generate realistic cloned voices of
humans using only a few seconds of audio of the target person. With the
emerging threat of potential harm deepfakes can cause, researchers have
proposed deepfake detection methods. However, they only focus on detecting a
single modality, i.e., either video or audio. On the other hand, to develop a
good deepfake detector that can cope with the recent advancements in deepfake
generation, we need to have a detector that can detect deepfakes of multiple
modalities, i.e., videos and audios. To build such a detector, we need a
dataset that contains video and respective audio deepfakes. We were able to
find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection
Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized
fake audios as well. We used this multimodal deepfake dataset and performed
detailed baseline experiments using state-of-the-art unimodal, ensemble-based,
and multimodal detection methods to evaluate it. We conclude through detailed
experimentation that unimodals, addressing only a single modality, video or
audio, do not perform well compared to ensemble-based methods. Whereas purely
multimodal-based baselines provide the worst performance.
Related papers
- Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud.
In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video.
We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z) - Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video
Deepfake Detection [10.36919027402249]
Synthetic videos of speaking humans can be used to spread misinformation in a convincing manner.
FakeOut is a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase.
Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets.
arXiv Detail & Related papers (2022-12-01T18:56:31Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset [21.199288324085444]
Recently, a new problem of generating cloned or synthesized human voice of a person is emerging.
With the emerging threat of impersonation attacks using deepfake videos and audios, new deepfake detectors are need that focuses on both, video and audio.
We propose a novel Audio-Video Deepfake dataset (FakeAVCeleb) that not only contains deepfake videos but respective synthesized cloned audios as well.
arXiv Detail & Related papers (2021-08-11T07:49:36Z) - WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection [82.42495493102805]
We introduce a new dataset WildDeepfake which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet.
We conduct a systematic evaluation of a set of baseline detection networks on both existing and our WildDeepfake datasets, and show that WildDeepfake is indeed a more challenging dataset, where the detection performance can decrease drastically.
arXiv Detail & Related papers (2021-01-05T11:10:32Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.