AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
- URL: http://arxiv.org/abs/2310.13103v2
- Date: Sun, 06 Jul 2025 11:50:10 GMT
- Title: AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection
- Authors: Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang,
- Abstract summary: This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
- Score: 49.81915942821647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.
Related papers
- Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.
Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - A Multimodal Framework for Deepfake Detection [0.0]
Deepfakes, synthetic media created using AI, can convincingly alter videos and audio to misrepresent reality.
Our research addresses the critical issue of deepfakes through an innovative multimodal approach.
Our framework combines visual and auditory analyses, yielding an accuracy of 94%.
arXiv Detail & Related papers (2024-10-04T14:59:10Z) - Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity.
Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat.
We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z) - A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection [17.285669984798975]
This paper addresses the challenge of developing a robust audio-visual deepfake detection model.
New generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods.
We propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique.
arXiv Detail & Related papers (2024-06-20T10:33:15Z) - AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection [2.985620880452743]
We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method for improved deepfake detection.
To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual masking and feature fusion strategy.
We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
arXiv Detail & Related papers (2024-06-05T05:20:12Z) - AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency
for Video Deepfake Detection [32.502184301996216]
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content.
Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection.
This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
arXiv Detail & Related papers (2023-11-05T18:35:03Z) - MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio
Cross-Attention and Facial Self-Attention [13.671150394943684]
We present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks.
Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network.
arXiv Detail & Related papers (2023-09-12T18:37:05Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake
Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos.
We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics.
Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z) - Glitch in the Matrix: A Large Scale Benchmark for Content Driven
Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF)
LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations.
The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - Multi-Modal Video Forensic Platform for Investigating Post-Terrorist
Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence.
We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.