AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency
for Video Deepfake Detection
- URL: http://arxiv.org/abs/2311.02733v1
- Date: Sun, 5 Nov 2023 18:35:03 GMT
- Title: AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency
for Video Deepfake Detection
- Authors: Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao,
Hsin-Min Wang
- Abstract summary: Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content.
Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection.
This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
- Score: 32.502184301996216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal manipulations (also known as audio-visual deepfakes) make it
difficult for unimodal deepfake detectors to detect forgeries in multimedia
content. To avoid the spread of false propaganda and fake news, timely
detection is crucial. The damage to either modality (i.e., visual or audio) can
only be discovered through multi-modal models that can exploit both pieces of
information simultaneously. Previous methods mainly adopt uni-modal video
forensics and use supervised pre-training for forgery detection. This study
proposes a new method based on a multi-modal self-supervised-learning (SSL)
feature extractor to exploit inconsistency between audio and visual modalities
for multi-modal video forgery detection. We use the transformer-based SSL
pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic
feature extractor and a multi-scale temporal convolutional neural network to
capture the temporal correlation between the audio and visual modalities. Since
AV-HuBERT only extracts visual features from the lip region, we also adopt
another transformer-based video model to exploit facial features and capture
spatial and temporal artifacts caused during the deepfake generation process.
Experimental results show that our model outperforms all existing models and
achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT
datasets.
Related papers
- A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection [17.285669984798975]
This paper addresses the challenge of developing a robust audio-visual deepfake detection model.
New generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods.
We propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique.
arXiv Detail & Related papers (2024-06-20T10:33:15Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio
Cross-Attention and Facial Self-Attention [13.671150394943684]
We present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks.
Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network.
arXiv Detail & Related papers (2023-09-12T18:37:05Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Audio-visual Representation Learning for Anomaly Events Detection in
Crowds [119.72951028190586]
This paper attempts to exploit multi-modal learning for modeling the audio and visual signals simultaneously.
We conduct the experiments on SHADE dataset, a synthetic audio-visual dataset in surveillance scenes.
We find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-10-28T02:42:48Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.