Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
- URL: http://arxiv.org/abs/2504.01470v1
- Date: Wed, 02 Apr 2025 08:24:06 GMT
- Title: Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
- Authors: Soumyya Kanti Datta, Shan Jia, Siwei Lyu,
- Abstract summary: Lip-syncing deepfakes are one of the most challenging deepfakes to detect.<n>We propose LIPINC-V2, a novel framework to detect lip-syncing deepfakes.<n>Our model can successfully capture both short-term and long-term variations in mouth movement.
- Score: 29.81606633121959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
Related papers
- Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud.
In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video.
We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z) - Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes [9.993053682230935]
Lip-forgery videos present a formidable challenge to existing DeepFake detection methods.
We propose a novel approach dedicated to lip-forgery identification that exploits the inconsistency between lip movements and audio signals.
Our approach gives an average accuracy of more than 95.3% in spotting lip-syncing videos.
arXiv Detail & Related papers (2024-01-28T14:22:11Z) - Exposing Lip-syncing Deepfakes from Mouth Inconsistencies [29.81606633121959]
A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio.
In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC) for lip-syncing deepfake detection.
arXiv Detail & Related papers (2024-01-18T16:35:37Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Undercover Deepfakes: Detecting Fake Segments in Videos [1.2609216345578933]
deepfake generation is a new paradigm of deepfakes which are mostly real videos altered slightly to distort the truth.
In this paper, we present a deepfake detection method that can address this issue by performing deepfake prediction at the frame and video levels.
In particular, the paradigm we address will form a powerful tool for the moderation of deepfakes, where human oversight can be better targeted to the parts of videos suspected of being deepfakes.
arXiv Detail & Related papers (2023-05-11T04:43:10Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z) - Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes.
We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection.
We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z) - Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery
Detection [118.37239586697139]
LipForensics is a detection approach capable of both generalising manipulations and withstanding various distortions.
It consists in first pretraining a-temporal network to perform visual speech recognition (lipreading)
A temporal network is subsequently finetuned on fixed mouth embeddings of real and forged data in order to detect fake videos based on mouth movements without over-fitting to low-level, manipulation-specific artefacts.
arXiv Detail & Related papers (2020-12-14T15:53:56Z) - Two-branch Recurrent Network for Isolating Deepfakes in Videos [17.59209853264258]
We present a method for deepfake detection based on a two-branch network structure.
One branch propagates the original information, while the other branch suppresses the face content.
Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks.
arXiv Detail & Related papers (2020-08-08T01:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.