Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
- URL: http://arxiv.org/abs/2510.20531v1
- Date: Thu, 23 Oct 2025 13:16:12 GMT
- Title: Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
- Authors: Lixiong Qin, Yang Zhang, Mei Wang, Jiani Hu, Weihong Deng, Weiran Xu,
- Abstract summary: We propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction.<n>We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts.<n>Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task.<n>We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis.
- Score: 53.733003768566455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.
Related papers
- MoCha:End-to-End Video Character Replacement without Structural Guidance [14.573557179926079]
MoCha is a framework for replacing video characters with a user-provided identity.<n>We introduce a condition-aware RoPE and employ an RL-based post-training stage.<n>We design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs.
arXiv Detail & Related papers (2026-01-13T14:10:34Z) - EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning [58.42596067220998]
deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation.<n>Traditional deepfake video detection methods face issues such as a lack of transparency in their principles and insufficient capabilities to cope with forgery techniques.<n>This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal reasoning framework.
arXiv Detail & Related papers (2025-10-18T10:34:05Z) - From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z) - Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor [32.34399128209528]
We study whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders.<n>We find diffusion features are both rich in semantics and can encode strong image-text alignment.<n>We then investigate how to align these features with large language models and uncover a leakage phenomenon.
arXiv Detail & Related papers (2025-07-09T17:59:47Z) - MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution [36.79921476565535]
VLF-FFD is a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection.<n> EFF++ is a frame-level, explainability-driven extension of the widely used FaceForensics++ dataset.<n>VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations.
arXiv Detail & Related papers (2025-05-04T06:58:21Z) - Towards General Visual-Linguistic Face Forgery Detection(V2) [90.6600794602029]
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust.<n>Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection.<n>We propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification.
arXiv Detail & Related papers (2025-02-28T04:15:36Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.<n>It captures high-order correlations of forged images from diverse linguistic feature spaces.<n>It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.