On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
- URL: http://arxiv.org/abs/2410.23623v1
- Date: Thu, 31 Oct 2024 04:20:47 GMT
- Title: On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
- Authors: Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu,
- Abstract summary: We propose an innovative algorithm named Multi-Mod-al Detection(MM-Det) for detecting diffusion-generated content.
MM-Det utilizes the profound and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR)
MM-Det achieves state-of-the-art performance in Video Forensics (DVF)
- Score: 44.55891118519547
- License:
- Abstract: Large numbers of synthesized videos from diffusion models pose threats to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms primarily focus on detecting facial forgeries and often fail to identify diffusion-generated content with a diverse range of semantics. To advance the field of video forensics, we propose an innovative algorithm named Multi-Modal Detection(MM-Det) for detecting diffusion-generated videos. MM-Det utilizes the profound perceptual and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR) from LMM's multi-modal space, enhancing its ability to detect unseen forgery content. Besides, MM-Det leverages an In-and-Across Frame Attention (IAFA) mechanism for feature augmentation in the spatio-temporal domain. A dynamic fusion strategy helps refine forgery representations for the fusion. Moreover, we construct a comprehensive diffusion video dataset, called Diffusion Video Forensics (DVF), across a wide range of forgery videos. MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at https://github.com/SparkleXFantasy/MM-Det.
Related papers
- Investigating Memorization in Video Diffusion Models [58.70363256771246]
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference.
We first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way.
We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs.
arXiv Detail & Related papers (2024-10-29T02:34:06Z) - Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities.
Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples.
We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio
Cross-Attention and Facial Self-Attention [13.671150394943684]
We present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks.
Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network.
arXiv Detail & Related papers (2023-09-12T18:37:05Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Denoising Bottleneck with Mutual Information Maximization for Video
Multimodal Fusion [30.631733395175765]
Video multimodal fusion aims to integrate multimodal signals in videos.
Video has longer multimodal sequences with more redundancy and noise in visual and audio modalities.
We propose a denoising bottleneck fusion model for fine-grained video fusion.
arXiv Detail & Related papers (2023-05-24T02:39:43Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Frame Aggregation and Multi-Modal Fusion Framework for Video-Based
Person Recognition [13.875674649636874]
We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition.
FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos.
We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
arXiv Detail & Related papers (2020-10-19T08:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.