Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes
- URL: http://arxiv.org/abs/2311.17655v1
- Date: Wed, 29 Nov 2023 14:18:04 GMT
- Title: Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes
- Authors: Pavel Korshunov, Haolin Chen, Philip N. Garner, and Sebastien Marcel
- Abstract summary: We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
- Score: 13.042731289687918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of deepfakes detection is far from being solved by speech or vision
researchers. Several publicly available databases of fake synthetic video and
speech were built to aid the development of detection methods. However,
existing databases typically focus on visual or voice modalities and provide no
proof that their deepfakes can in fact impersonate any real person. In this
paper, we present the first realistic audio-visual database of deepfakes
SWAN-DF, where lips and speech are well synchronized and video have high visual
and audio qualities. We took the publicly available SWAN dataset of real videos
with different identities to create audio-visual deepfakes using several models
from DeepFaceLab and blending techniques for face swapping and HiFiVC, DiffVC,
YourTTS, and FreeVC models for voice conversion. From the publicly available
speech dataset LibriTTS, we also created a separate database of only audio
deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS,
Adaspeech, and TorToiSe. We demonstrate the vulnerability of a state of the art
speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain, to
the synthetic voices. Similarly, we tested face recognition system based on the
MobileFaceNet architecture to several variants of our visual deepfakes. The
vulnerability assessment show that by tuning the existing pretrained deepfake
models to specific identities, one can successfully spoof the face and speaker
recognition systems in more than 90% of the time and achieve a very realistic
looking and sounding fake video of a given person.
Related papers
- Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud.
In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video.
We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z) - SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - FTFDNet: Learning to Detect Talking Face Video Manipulation with
Tri-Modality Interaction [9.780101247514366]
The optical flow of the fake talking face video is disordered especially in the lip region.
A novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features.
The proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods.
arXiv Detail & Related papers (2023-07-08T14:45:16Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - An Audio-Visual Attention Based Multimodal Network for Fake Talking Face
Videos Detection [45.210105822471256]
FTFDNet is proposed by incorporating audio and visual representation to achieve more accurate fake talking face videos detection.
The evaluation of the proposed work has shown an excellent performance on the detection of fake talking face videos, which is able to arrive at a detection rate above 97%.
arXiv Detail & Related papers (2022-03-10T06:16:11Z) - Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal
and Multimodal Detectors [18.862258543488355]
Deepfakes can cause security and privacy issues.
New domain of cloning human voices using deep-learning technologies is also emerging.
To develop a good deepfake detector, we need a detector that can detect deepfakes of multiple modalities.
arXiv Detail & Related papers (2021-09-07T11:00:20Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.