Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
- URL: http://arxiv.org/abs/2601.00777v1
- Date: Fri, 02 Jan 2026 18:17:22 GMT
- Title: Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
- Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall,
- Abstract summary: Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes.<n>We aim to explore the potential of MLLMs for audio deepfake detection.
- Score: 6.491407316650203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
Related papers
- A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z) - ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection [49.14187862877009]
We present ERF-BA-TFD+, a novel deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion.<n>Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness.<n>We evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips.
arXiv Detail & Related papers (2025-08-24T10:03:46Z) - KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features [1.488627850405606]
We propose multimodal approaches for the AV-Deepfake1M 2025 challenge.<n>For the visual modality, we leverage handcrafted features to improve interpretability and adaptability.<n>For the audio modality, we adapt a self-supervised learning backbone coupled with graph attention networks to capture rich audio representations.<n>Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability.
arXiv Detail & Related papers (2025-08-10T13:29:08Z) - Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework [19.53717894228692]
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation.<n>We propose a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework.<n>Our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes.
arXiv Detail & Related papers (2025-06-09T02:13:04Z) - Can Multi-modal (reasoning) LLMs work as deepfake detectors? [6.36797761822772]
We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets.<n>Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot.<n>This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks.
arXiv Detail & Related papers (2025-03-25T21:47:29Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception [30.295294657519165]
Multimodal deepfakes involving audiovisual manipulations are a growing threat because they are difficult to detect with the naked eye or using unimodal deep learningbased forgery detection methods.
In this study, we examine the detection capabilities of a large language model (LLM) to identify and account for any possible visual and auditory artifacts and manipulations in audiovisual deepfake content.
arXiv Detail & Related papers (2024-11-14T08:07:02Z) - Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics [46.99625341531352]
DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation.
We investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection.
arXiv Detail & Related papers (2024-03-21T01:57:30Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.