Related papers: EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

URL: http://arxiv.org/abs/2510.16442v1
Date: Sat, 18 Oct 2025 10:34:05 GMT
Title: EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
Authors: Haoran Sun, Chen Cai, Huiping Zhuang, Kong Aik Lee, Lap-Pui Chau, Yi Wang,
Abstract summary: deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation.<n>Traditional deepfake video detection methods face issues such as a lack of transparency in their principles and insufficient capabilities to cope with forgery techniques.<n>This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal reasoning framework.
Score: 58.42596067220998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

Related papers

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z)
Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z)
LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation [32.57236582010967]
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data.<n>We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference.<n>We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation.
arXiv Detail & Related papers (2025-12-18T18:52:18Z)
Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection [32.26866389632305]
We introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis.<n>The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection.<n>To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT.
arXiv Detail & Related papers (2025-10-28T10:04:13Z)
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning [58.70446237944036]
DAVID-X is the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales.<n>We present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning.<n>Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
arXiv Detail & Related papers (2025-06-13T13:39:53Z)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [63.82450803014141]
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity.<n>We propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset.
arXiv Detail & Related papers (2025-05-23T16:37:36Z)
LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection.<n>We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement.<n>Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos [9.209808258321559]
Attributing a deep-fake to its specific generation model or encoder is vital for forensic analysis, enabling source and tailored countermeasures.<n>We investigate the model attribution problem for deep-fake videos using two datasets: Deepfakes from Different Models (DFDM) and GANGen-Detection.<n>We introduce a novel Capsule-Spatial-Cap (CapST) model that integrates a truncated VGG19 network for feature extraction, capsule networks for temporal extraction.
arXiv Detail & Related papers (2023-11-07T08:05:09Z)
Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features. On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features. Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.