Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection
- URL: http://arxiv.org/abs/2503.14853v2
- Date: Sat, 07 Jun 2025 08:33:19 GMT
- Title: Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection
- Authors: Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, Chip Hong Chang,
- Abstract summary: Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data.<n>We present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection.
- Score: 18.125287697902813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.
Related papers
- LLMs Are Not Yet Ready for Deepfake Image Detection [8.364956401923108]
Vision-language models (VLMs) have emerged as promising tools across various domains.<n>This study focuses on three primary deepfake types: faceswap, reenactment, and synthetic generation.<n>Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems.
arXiv Detail & Related papers (2025-06-12T08:27:24Z) - Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection [84.21257150497254]
We propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection.<n> Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance.
arXiv Detail & Related papers (2025-05-06T15:09:23Z) - MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution [36.79921476565535]
VLF-FFD is a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection.<n> EFF++ is a frame-level, explainability-driven extension of the widely used FaceForensics++ dataset.<n>VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations.
arXiv Detail & Related papers (2025-05-04T06:58:21Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.
We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding [36.376220619032225]
REF-VLM is an end-to-end framework for unified training of various visual decoding tasks.<n>We construct a large-scale multi-task dataset containing over 100 million multimodal dialogue samples.<n> REF-VLM outperforms other MLLMs across a variety of standard benchmarks.
arXiv Detail & Related papers (2025-03-10T14:59:14Z) - VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models [14.053424085561296]
Face models with high-quality and controllable attributes pose a significant challenge for Deepfake detection.<n>In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics.<n>We propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators.
arXiv Detail & Related papers (2025-03-08T09:55:19Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight [2.290956583394892]
Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs)
This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024.
arXiv Detail & Related papers (2024-12-24T09:05:37Z) - ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection [107.86009509291581]
We propose ForgerySleuth to perform comprehensive clue fusion and generate segmentation outputs indicating regions that are tampered with.<n>Our experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in robustness, generalization, and explainability.
arXiv Detail & Related papers (2024-11-29T04:35:18Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.<n>It captures high-order correlations of forged images from diverse linguistic feature spaces.<n>It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories.
Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection.
We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions [126.3136109870403]
We introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C)
VPG-C infers and completes the missing details essential for comprehending demonstrative instructions.
We build DEMON, a comprehensive benchmark for demonstrative instruction understanding.
arXiv Detail & Related papers (2023-08-08T09:32:43Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.