Related papers: MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

URL: http://arxiv.org/abs/2310.02234v2
Date: Fri, 13 Oct 2023 23:43:15 GMT
Title: MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection
Authors: Vinaya Sree Katamneni and Ajita Rattani
Abstract summary: A novel kind of deepfakes has emerged with either audio or visual modalities manipulated. Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video. In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
Score: 4.659427498118277
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deepfakes are synthetic media generated using deep generative algorithms and have posed a severe societal and political threat. Apart from facial manipulation and synthetic voice, recently, a novel kind of deepfakes has emerged with either audio or visual modalities manipulated. In this regard, a new generation of multimodal audio-visual deepfake detectors is being investigated to collectively focus on audio and visual data for multimodal manipulation detection. Existing multimodal (audio-visual) deepfake detectors are often based on the fusion of the audio and visual streams from the video. Existing studies suggest that these multimodal detectors often obtain equivalent performances with unimodal audio and visual deepfake detectors. We conjecture that the heterogeneous nature of the audio and visual signals creates distributional modality gaps and poses a significant challenge to effective fusion and efficient performance. In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection. Specifically, we propose the joint use of modality (audio and visual) invariant and specific representations. This ensures that the common patterns and patterns specific to each modality representing pristine or fake content are preserved and fused for multimodal deepfake manipulation detection. Our experimental results on FakeAVCeleb and KoDF audio-visual deepfake datasets suggest the enhanced accuracy of our proposed method over SOTA unimodal and multimodal audio-visual deepfake detectors by $17.8$% and $18.4$%, respectively. Thus, obtaining state-of-the-art performance.

Related papers

Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework [19.53717894228692]
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation.<n>We propose a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework.<n>Our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes.
arXiv Detail & Related papers (2025-06-09T02:13:04Z)
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook [101.30779332427217]
We survey deepfake generation and detection techniques, including the most recent developments in the field. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content.
arXiv Detail & Related papers (2024-11-29T08:29:25Z)
DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization [13.840950434728533]
DiMoDif is an audio-visual deepfake detection framework. It exploits the inter-modality differences in machine perception of speech. It temporally localizes the deepfake forgery.
arXiv Detail & Related papers (2024-11-15T13:47:33Z)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z)
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection [32.502184301996216]
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
arXiv Detail & Related papers (2023-11-05T18:35:03Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Integrating Audio-Visual Features for Multimodal Deepfake Detection [33.51027054306748]
Deepfakes are AI-generated media in which an image or video has been digitally modified. This paper proposes an audio-visual-based method for deepfake detection, which integrates fine-grained deepfake identification with binary classification.
arXiv Detail & Related papers (2023-10-05T18:19:56Z)
DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention [13.671150394943684]
We present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network.
arXiv Detail & Related papers (2023-09-12T18:37:05Z)
NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos. We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [74.19291916812921]
forged images generated by Deepfake techniques pose a serious threat to the trustworthiness of digital information. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods.
arXiv Detail & Related papers (2021-04-20T05:43:44Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.