FTFDNet: Learning to Detect Talking Face Video Manipulation with
Tri-Modality Interaction
- URL: http://arxiv.org/abs/2307.03990v1
- Date: Sat, 8 Jul 2023 14:45:16 GMT
- Title: FTFDNet: Learning to Detect Talking Face Video Manipulation with
Tri-Modality Interaction
- Authors: Ganglai Wang, Peng Zhang, Junwen Xiong, Feihan Yang, Wei Huang, and
Yufei Zha
- Abstract summary: The optical flow of the fake talking face video is disordered especially in the lip region.
A novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features.
The proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods.
- Score: 9.780101247514366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DeepFake based digital facial forgery is threatening public media security,
especially when lip manipulation has been used in talking face generation, and
the difficulty of fake video detection is further improved. By only changing
lip shape to match the given speech, the facial features of identity are hard
to be discriminated in such fake talking face videos. Together with the lack of
attention on audio stream as the prior knowledge, the detection failure of fake
talking face videos also becomes inevitable. It's found that the optical flow
of the fake talking face video is disordered especially in the lip region while
the optical flow of the real video changes regularly, which means the motion
feature from optical flow is useful to capture manipulation cues. In this
study, a fake talking face detection network (FTFDNet) is proposed by
incorporating visual, audio and motion features using an efficient cross-modal
fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism
(AVAM) is proposed to discover more informative features, which can be
seamlessly integrated into any audio-visual CNN architecture by modularization.
With the additional AVAM, the proposed FTFDNet is able to achieve a better
detection performance than other state-of-the-art DeepFake video detection
methods not only on the established fake talking face detection dataset (FTFDD)
but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).
Related papers
- Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud.
In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video.
We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z) - GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection [7.591187423217017]
This paper introduces a novel method for robust DeepFake video detection based on graph convolutional network with graph Laplacian.
The proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences.
arXiv Detail & Related papers (2024-06-28T14:17:16Z) - Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Mover: Mask and Recovery based Facial Part Consistency Aware Method for
Deepfake Video Detection [33.29744034340998]
Mover is a new Deepfake detection model that exploits unspecific facial part inconsistencies.
We propose a novel model with dual networks that utilize the pretrained encoder and masked autoencoder.
Our experiments on standard benchmarks demonstrate that Mover is highly effective.
arXiv Detail & Related papers (2023-03-03T06:57:22Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - An Audio-Visual Attention Based Multimodal Network for Fake Talking Face
Videos Detection [45.210105822471256]
FTFDNet is proposed by incorporating audio and visual representation to achieve more accurate fake talking face videos detection.
The evaluation of the proposed work has shown an excellent performance on the detection of fake talking face videos, which is able to arrive at a detection rate above 97%.
arXiv Detail & Related papers (2022-03-10T06:16:11Z) - Watch Those Words: Video Falsification Detection Using Word-Conditioned
Facial Motion [82.06128362686445]
We propose a multi-modal semantic forensic approach to handle both cheapfakes and visually persuasive deepfakes.
We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others.
Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation.
arXiv Detail & Related papers (2021-12-21T01:57:04Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.