Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
- URL: http://arxiv.org/abs/2511.18104v1
- Date: Sat, 22 Nov 2025 16:05:12 GMT
- Title: Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
- Authors: Xiaohong Liu, Xiufeng Song, Huayu Zheng, Lei Bai, Xiaoming Liu, Guangtao Zhai,
- Abstract summary: Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored.<n>We propose a consolidated multimodal detection, named MM-Det++, specifically designed for detecting diffusion-generated videos.
- Score: 61.3737746844896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.
Related papers
- Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - LMM-Det: Make Large Multimodal Models Excel in Object Detection [0.62914438169038]
We propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules.<n>Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models.<n>We claim that a large multimodal model possesses detection capability without any extra detection modules.
arXiv Detail & Related papers (2025-07-24T11:05:24Z) - Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion [7.728348842555291]
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination.<n>Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature.<n>We present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism.
arXiv Detail & Related papers (2025-05-17T15:24:48Z) - On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection [44.55891118519547]
We propose an innovative algorithm named Multi-Mod-al Detection(MM-Det) for detecting diffusion-generated content.<n>MM-Det utilizes the profound and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR)<n>MM-Det achieves state-of-the-art performance in Video Forensics (DVF)
arXiv Detail & Related papers (2024-10-31T04:20:47Z) - Investigating Memorization in Video Diffusion Models [58.70363256771246]
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference.<n>We first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way.<n>We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs.
arXiv Detail & Related papers (2024-10-29T02:34:06Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Multimodal Short Video Rumor Detection System Based on Contrastive
Learning [3.4192832062683842]
Short video platforms in China have gradually evolved into fertile grounds for the proliferation of fake news.
distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features.
Our research group proposes a methodology encompassing multimodal feature fusion and the integration of external knowledge.
arXiv Detail & Related papers (2023-04-17T16:07:00Z) - Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on
Facial Action Unit Detection [12.509298933267225]
This paper presents a novel multi-modal reconstruction network, named Multimodal Channel-Mixing (MCM) as a pre-trained model to learn robust representation for facilitating multi-modal fusion.
The approach follows an early fusion setup, integrating a Channel-Mixing module, where two out of five channels are randomly dropped.
This module not only reduces channel redundancy, but also facilitates multi-modal learning and reconstruction capabilities, resulting in robust feature learning.
arXiv Detail & Related papers (2022-09-25T15:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.