NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection
- URL: http://arxiv.org/abs/2306.06885v1
- Date: Mon, 12 Jun 2023 06:06:05 GMT
- Title: NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection
- Authors: Yu Chen, Yang Yu, Rongrong Ni, Yao Zhao, Haoliang Li
- Abstract summary: Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
- Score: 50.33525966541906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deepfake technologies empowered by deep learning are rapidly evolving,
creating new security concerns for society. Existing multimodal detection
methods usually capture audio-visual inconsistencies to expose Deepfake videos.
More seriously, the advanced Deepfake technology realizes the audio-visual
calibration of the critical phoneme-viseme regions, achieving a more realistic
tampering effect, which brings new challenges. To address this problem, we
propose a novel Deepfake detection method to mine the correlation between
Non-critical Phonemes and Visemes, termed NPVForensics. Firstly, we propose the
Local Feature Aggregation block with Swin Transformer (LFA-ST) to construct
non-critical phoneme-viseme and corresponding facial feature streams
effectively. Secondly, we design a loss function for the fine-grained motion of
the talking face to measure the evolutionary consistency of non-critical
phoneme-viseme. Next, we design a phoneme-viseme awareness module for
cross-modal feature fusion and representation alignment, so that the modality
gap can be reduced and the intrinsic complementarity of the two modalities can
be better explored. Finally, a self-supervised pre-training strategy is
leveraged to thoroughly learn the audio-visual correspondences in natural
videos. In this manner, our model can be easily adapted to the downstream
Deepfake datasets with fine-tuning. Extensive experiments on existing
benchmarks demonstrate that the proposed approach outperforms state-of-the-art
methods.
Related papers
- Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity.
Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat.
We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z) - Conditioned Prompt-Optimization for Continual Deepfake Detection [11.634681724245933]
This paper introduces Prompt2Guard, a novel solution for photorealistic-free continual deepfake detection of images.
We leverage a prediction ensembling technique with read-only prompts, mitigating the need for multiple forward passes.
Our method exploits a text-prompt conditioning tailored to deepfake detection, which we demonstrate is beneficial in our setting.
arXiv Detail & Related papers (2024-07-31T12:22:57Z) - The Tug-of-War Between Deepfake Generation and Detection [4.62070292702111]
Multimodal generative models are rapidly evolving, leading to a surge in the generation of realistic video and audio.
Deepfake videos, which can convincingly impersonate individuals, have particularly garnered attention due to their potential misuse.
This survey paper examines the dual landscape of deepfake video generation and detection, emphasizing the need for effective countermeasures.
arXiv Detail & Related papers (2024-07-08T17:49:41Z) - Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning [0.0]
Deepfake technology has raised concerns about the authenticity of digital content, necessitating the development of effective detection methods.
Adversaries can manipulate deepfake videos with small, imperceptible perturbations that can deceive the detection models into producing incorrect outputs.
We introduce Adversarial Feature Similarity Learning (AFSL), which integrates three fundamental deep feature learning paradigms.
arXiv Detail & Related papers (2024-02-06T11:35:05Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition [53.860796916196634]
We propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF)
Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts.
It adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination.
arXiv Detail & Related papers (2023-09-30T12:30:25Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - Self-supervised Transformer for Deepfake Detection [112.81127845409002]
Deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors.
Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection.
In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method.
arXiv Detail & Related papers (2022-03-02T17:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.