Leveraging large multimodal models for audio-video deepfake detection: a pilot study
- URL: http://arxiv.org/abs/2602.23393v1
- Date: Wed, 25 Feb 2026 04:39:08 GMT
- Title: Leveraging large multimodal models for audio-video deepfake detection: a pilot study
- Authors: Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma,
- Abstract summary: We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?"<n>Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning.<n>On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
- Score: 20.17103408581687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Related papers
- Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection [6.491407316650203]
Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes.<n>We aim to explore the potential of MLLMs for audio deepfake detection.
arXiv Detail & Related papers (2026-01-02T18:17:22Z) - ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection [49.14187862877009]
We present ERF-BA-TFD+, a novel deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion.<n>Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness.<n>We evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips.
arXiv Detail & Related papers (2025-08-24T10:03:46Z) - Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework [19.53717894228692]
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation.<n>We propose a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework.<n>Our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes.
arXiv Detail & Related papers (2025-06-09T02:13:04Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency
for Video Deepfake Detection [32.502184301996216]
Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content.
Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection.
This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor.
arXiv Detail & Related papers (2023-11-05T18:35:03Z) - AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - MIS-AVoiDD: Modality Invariant and Specific Representation for
Audio-Visual Deepfake Detection [4.659427498118277]
A novel kind of deepfakes has emerged with either audio or visual modalities manipulated.
Existing multimodal deepfake detectors are often based on the fusion of the audio and visual streams from the video.
In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection.
arXiv Detail & Related papers (2023-10-03T17:43:24Z) - Betray Oneself: A Novel Audio DeepFake Detection Model via
Mono-to-Stereo Conversion [70.99781219121803]
Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc.
We propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process.
arXiv Detail & Related papers (2023-05-25T02:54:29Z) - Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos.
We propose to perform the deepfake detection from an unexplored voice-face matching view.
Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.