Related papers: ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

URL: http://arxiv.org/abs/2508.17282v1
Date: Sun, 24 Aug 2025 10:03:46 GMT
Title: ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection
Authors: Xin Zhang, Jiaming Chu, Jian Zhao, Yuchu Jiang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li,
Abstract summary: We present ERF-BA-TFD+, a novel deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion.<n>Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness.<n>We evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips.
Score: 49.14187862877009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.

Related papers

Leveraging large multimodal models for audio-video deepfake detection: a pilot study [20.17103408581687]
We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?"<n>Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning.<n>On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
arXiv Detail & Related papers (2026-02-25T04:39:08Z)
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework [19.53717894228692]
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation.<n>We propose a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework.<n>Our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes.
arXiv Detail & Related papers (2025-06-09T02:13:04Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection [32.502184301996216]
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
arXiv Detail & Related papers (2023-11-05T18:35:03Z)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization [20.46053083071752]
We propose and benchmark a new dataset, Localized Visual DeepFake (LAV-DF) LAV-DF consists of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture.
arXiv Detail & Related papers (2023-05-03T08:48:45Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.