Related papers: Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

URL: http://arxiv.org/abs/2511.22443v1
Date: Thu, 27 Nov 2025 13:30:59 GMT
Title: Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
Authors: Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva,
Abstract summary: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio.<n>To mitigate such misuse, robust and reliable deepfake detection is urgently needed.<n>We propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features.
Score: 8.510683305368278
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.

Related papers

ExDDV: A New Dataset for Explainable Deepfake Detection in Video [23.169975307069066]
We introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video.<n>We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies.<n>Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos.
arXiv Detail & Related papers (2025-03-18T16:55:07Z)
Deepfake detection in videos with multiple faces using geometric-fakeness features [79.16635054977068]
Deepfakes of victims or public figures can be used by fraudsters for blackmailing, extorsion and financial fraud. In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video. We employ our approach to analyze videos with multiple faces that are simultaneously present in a video.
arXiv Detail & Related papers (2024-10-10T13:10:34Z)
Shaking the Fake: Detecting Deepfake Videos in Real Time via Active Probes [3.6308756891251392]
Real-time deepfake, a type of generative AI, is capable of "creating" non-existing contents (e.g., swapping one's face with another) in a video. It has been misused to produce deepfake videos for malicious purposes, including financial scams and political misinformation. We propose SFake, a new real-time deepfake detection method that exploits deepfake models' inability to adapt to physical interference.
arXiv Detail & Related papers (2024-09-17T04:58:30Z)
GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer [10.135975246717113]
We propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection.<n>Our model combines ConvationalNeXt and Swin Transformer models for feature extraction.<n>By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos.
arXiv Detail & Related papers (2023-07-13T19:27:40Z)
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake. A manipulated audio is generated by only tampering with the acoustic scene of an original audio. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes. We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection. We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z)
What's wrong with this video? Comparing Explainers for Deepfake Detection [13.089182408360221]
Deepfakes are computer manipulated videos where the face of an individual has been replaced with that of another. In this work we develop, extend and compare white-box, black-box and model-specific techniques for explaining the labelling of real and fake videos. In particular, we adapt SHAP, GradCAM and self-attention models to the task of explaining the predictions of state-of-the-art detectors based on EfficientNet.
arXiv Detail & Related papers (2021-05-12T18:44:39Z)
WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection [82.42495493102805]
We introduce a new dataset WildDeepfake which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. We conduct a systematic evaluation of a set of baseline detection networks on both existing and our WildDeepfake datasets, and show that WildDeepfake is indeed a more challenging dataset, where the detection performance can decrease drastically.
arXiv Detail & Related papers (2021-01-05T11:10:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.