Can Multi-modal (reasoning) LLMs work as deepfake detectors?
- URL: http://arxiv.org/abs/2503.20084v2
- Date: Sat, 29 Mar 2025 19:19:14 GMT
- Title: Can Multi-modal (reasoning) LLMs work as deepfake detectors?
- Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang, Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu,
- Abstract summary: We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets.<n>Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot.<n>This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks.
- Score: 6.36797761822772
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.
Related papers
- DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis [62.31018417955254]
DeepMLF is a novel multimodal language model with learnable tokens tailored toward deep fusion.
Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches.
arXiv Detail & Related papers (2025-04-15T11:28:02Z) - Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization [3.9440964696313485]
In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity.
Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat.
We propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection.
arXiv Detail & Related papers (2024-08-02T18:45:01Z) - Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics [46.99625341531352]
DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation.
We investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection.
arXiv Detail & Related papers (2024-03-21T01:57:30Z) - Linguistic Profiling of Deepfakes: An Open Database for Next-Generation
Deepfake Detection [40.20982463380279]
This paper introduces a deepfake database (DFLIP-3K) for the development of convincing and explainable deepfake detection.
It encompasses about 300K diverse deepfake samples from approximately 3K generative models, which boasts the largest number of deepfake models in the literature.
The two distinguished features enable DFLIP-3K to develop a benchmark that promotes progress in linguistic profiling of deepfakes.
arXiv Detail & Related papers (2024-01-04T16:19:52Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Quality-Agnostic Deepfake Detection with Intra-model Collaborative
Learning [26.517887637150594]
Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information.
Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type.
We propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes.
arXiv Detail & Related papers (2023-09-12T02:01:31Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [74.19291916812921]
forged images generated by Deepfake techniques pose a serious threat to the trustworthiness of digital information.
In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection.
We introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods.
arXiv Detail & Related papers (2021-04-20T05:43:44Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.