An Audio-Visual Attention Based Multimodal Network for Fake Talking Face
Videos Detection
- URL: http://arxiv.org/abs/2203.05178v1
- Date: Thu, 10 Mar 2022 06:16:11 GMT
- Title: An Audio-Visual Attention Based Multimodal Network for Fake Talking Face
Videos Detection
- Authors: Ganglai Wang, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha and Yanning
Zhang
- Abstract summary: FTFDNet is proposed by incorporating audio and visual representation to achieve more accurate fake talking face videos detection.
The evaluation of the proposed work has shown an excellent performance on the detection of fake talking face videos, which is able to arrive at a detection rate above 97%.
- Score: 45.210105822471256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DeepFake based digital facial forgery is threatening the public media
security, especially when lip manipulation has been used in talking face
generation, the difficulty of fake video detection is further improved. By only
changing lip shape to match the given speech, the facial features of identity
is hard to be discriminated in such fake talking face videos. Together with the
lack of attention on audio stream as the prior knowledge, the detection failure
of fake talking face generation also becomes inevitable. Inspired by the
decision-making mechanism of human multisensory perception system, which
enables the auditory information to enhance post-sensory visual evidence for
informed decisions output, in this study, a fake talking face detection
framework FTFDNet is proposed by incorporating audio and visual representation
to achieve more accurate fake talking face videos detection. Furthermore, an
audio-visual attention mechanism (AVAM) is proposed to discover more
informative features, which can be seamlessly integrated into any audio-visual
CNN architectures by modularization. With the additional AVAM, the proposed
FTFDNet is able to achieve a better detection performance on the established
dataset (FTFDD). The evaluation of the proposed work has shown an excellent
performance on the detection of fake talking face videos, which is able to
arrive at a detection rate above 97%.
Related papers
- Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - FTFDNet: Learning to Detect Talking Face Video Manipulation with
Tri-Modality Interaction [9.780101247514366]
The optical flow of the fake talking face video is disordered especially in the lip region.
A novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features.
The proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods.
arXiv Detail & Related papers (2023-07-08T14:45:16Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Watch Those Words: Video Falsification Detection Using Word-Conditioned
Facial Motion [82.06128362686445]
We propose a multi-modal semantic forensic approach to handle both cheapfakes and visually persuasive deepfakes.
We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others.
Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation.
arXiv Detail & Related papers (2021-12-21T01:57:04Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - VideoForensicsHQ: Detecting High-quality Manipulated Face Videos [77.60295082172098]
We show how the performance of forgery detectors depends on the presence of artefacts that the human eye can see.
We introduce a new benchmark dataset for face video forgery detection, of unprecedented quality.
arXiv Detail & Related papers (2020-05-20T21:17:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.