Audio-Visual Person-of-Interest DeepFake Detection
- URL: http://arxiv.org/abs/2204.03083v3
- Date: Thu, 18 May 2023 06:56:42 GMT
- Title: Audio-Visual Person-of-Interest DeepFake Detection
- Authors: Davide Cozzolino, Alessandro Pianese, Matthias Nie{\ss}ner, Luisa
Verdoliva
- Abstract summary: The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
- Score: 77.04789677645682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Face manipulation technology is advancing very rapidly, and new methods are
being proposed day by day. The aim of this work is to propose a deepfake
detector that can cope with the wide variety of manipulation methods and
scenarios encountered in the real world. Our key insight is that each person
has specific characteristics that a synthetic generator likely cannot
reproduce. Accordingly, we extract audio-visual features which characterize the
identity of a person, and use them to create a person-of-interest (POI)
deepfake detector. We leverage a contrastive learning paradigm to learn the
moving-face and audio segment embeddings that are most discriminative for each
identity. As a result, when the video and/or audio of a person is manipulated,
its representation in the embedding space becomes inconsistent with the real
identity, allowing reliable detection. Training is carried out exclusively on
real talking-face video; thus, the detector does not depend on any specific
manipulation method and yields the highest generalization ability. In addition,
our method can detect both single-modality (audio-only, video-only) and
multi-modality (audio-video) attacks, and is robust to low-quality or corrupted
videos. Experiments on a wide variety of datasets confirm that our method
ensures a SOTA performance, especially on low quality videos. Code is publicly
available on-line at https://github.com/grip-unina/poi-forensics.
Related papers
- Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud.
In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video.
We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z) - SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video
Deepfake Detection [10.36919027402249]
Synthetic videos of speaking humans can be used to spread misinformation in a convincing manner.
FakeOut is a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase.
Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets.
arXiv Detail & Related papers (2022-12-01T18:56:31Z) - Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis [60.13902294276283]
We present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated).
Many of the existing deepfake datasets focus exclusively on two types of facial manipulations -- swapping with a different subject's face or altering the existing face.
Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham.
arXiv Detail & Related papers (2022-07-26T17:39:04Z) - Watch Those Words: Video Falsification Detection Using Word-Conditioned
Facial Motion [82.06128362686445]
We propose a multi-modal semantic forensic approach to handle both cheapfakes and visually persuasive deepfakes.
We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others.
Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation.
arXiv Detail & Related papers (2021-12-21T01:57:04Z) - Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal
and Multimodal Detectors [18.862258543488355]
Deepfakes can cause security and privacy issues.
New domain of cloning human voices using deep-learning technologies is also emerging.
To develop a good deepfake detector, we need a detector that can detect deepfakes of multiple modalities.
arXiv Detail & Related papers (2021-09-07T11:00:20Z) - ID-Reveal: Identity-aware DeepFake Video Detection [24.79483180234883]
ID-Reveal is a new approach that learns temporal facial features, specific of how a person moves while talking.
We do not need any training data of fakes, but only train on real videos.
We obtain an average improvement of more than 15% in terms of accuracy for facial reenactment on high compressed videos.
arXiv Detail & Related papers (2020-12-04T10:43:16Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.