Related papers: Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

URL: http://arxiv.org/abs/2412.12278v1
Date: Mon, 16 Dec 2024 19:00:19 GMT
Title: Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
Authors: Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury,
Abstract summary: Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing.<n>We introduce the underlineUniversal underlineNetwork for underlineIdentifying underlineTampered and synthunderlineEtic videos (textttUNITE) model, which captures full-frame manipulations.<n>textttUNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications.
Score: 20.52229907426726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

Related papers

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z)
From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z)
Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors [2.89209645531276]
Face swapping manipulations in video streams represents an increasing threat in remote video communications.<n>Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms.<n>This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora.
arXiv Detail & Related papers (2025-06-19T17:51:11Z)
FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation [12.894864326299544]
We present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT) In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT)
arXiv Detail & Related papers (2025-02-19T06:50:27Z)
Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z)
Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector [15.647035299476894]
This publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. It specifically focuses on improving the global texture module, which extracts feature map correlation.
arXiv Detail & Related papers (2024-08-29T20:26:27Z)
UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos [88.08209394979178]
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations. We introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features.
arXiv Detail & Related papers (2023-12-09T03:16:09Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Voice-Face Homogeneity Tells Deepfake [56.334968246631725]
Existing detection approaches contribute to exploring the specific artifacts in deepfake videos. We propose to perform the deepfake detection from an unexplored voice-face matching view. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.
arXiv Detail & Related papers (2022-03-04T09:08:50Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Two-branch Recurrent Network for Isolating Deepfakes in Videos [17.59209853264258]
We present a method for deepfake detection based on a two-branch network structure. One branch propagates the original information, while the other branch suppresses the face content. Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks.
arXiv Detail & Related papers (2020-08-08T01:38:56Z)
Dynamic texture analysis for detecting fake faces in video sequences [6.1356022122903235]
This work explores the analysis of texture-temporal dynamics of the video signal. The goal is to characterizing and distinguishing real fake sequences. We propose to build multiple binary decision on the joint analysis of temporal segments.
arXiv Detail & Related papers (2020-07-30T07:21:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.