CapST: An Enhanced and Lightweight Model Attribution Approach for
Synthetic Videos
- URL: http://arxiv.org/abs/2311.03782v3
- Date: Mon, 22 Jan 2024 14:52:14 GMT
- Title: CapST: An Enhanced and Lightweight Model Attribution Approach for
Synthetic Videos
- Authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure,
Sarwar Khan, Sahibzada Adil Shahzad
- Abstract summary: This paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM)
The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio.
Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos.
- Score: 9.209808258321559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deepfake videos, generated through AI faceswapping techniques, have garnered
considerable attention due to their potential for powerful impersonation
attacks. While existing research primarily focuses on binary classification to
discern between real and fake videos, however determining the specific
generation model for a fake video is crucial for forensic investigation.
Addressing this gap, this paper investigates the model attribution problem of
Deepfake videos from a recently proposed dataset, Deepfakes from Different
Models (DFDM), derived from various Autoencoder models. The dataset comprises
6,450 Deepfake videos generated by five distinct models with variations in
encoder, decoder, intermediate layer, input resolution, and compression ratio.
This study formulates Deepfakes model attribution as a multiclass
classification task, proposing a segment of VGG19 as a feature extraction
backbone, known for its effectiveness in imagerelated tasks, while integrated a
Capsule Network with a Spatio-Temporal attention mechanism. The Capsule module
captures intricate hierarchies among features for robust identification of
deepfake attributes. Additionally, the video-level fusion technique leverages
temporal attention mechanisms to handle concatenated feature vectors,
capitalizing on inherent temporal dependencies in deepfake videos. By
aggregating insights across frames, our model gains a comprehensive
understanding of video content, resulting in more precise predictions.
Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the
efficacy of our proposed method, achieving up to a 4% improvement in accurately
categorizing deepfake videos compared to baseline models while demanding fewer
computational resources.
Related papers
- Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z) - FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes [9.462613446025001]
Face-fake Deepfake videos pose growing risks to digital security, privacy, and media integrity.<n>FAME is a framework designed to capture subtle artifacts specific to different face-generative models.<n>Results show that FAME consistently outperforms existing methods in both accuracy and runtime.
arXiv Detail & Related papers (2025-06-13T05:47:09Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.
Our model achieves 8.5x improvements in generation speed compared to the teacher model.
Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - Deepfake Detection with Spatio-Temporal Consistency and Attention [46.1135899490656]
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism.
Current methods for detecting forged videos rely mainly on global frame features.
We propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos.
arXiv Detail & Related papers (2025-02-12T08:51:33Z) - Pre-training for Action Recognition with Automatically Generated Fractal Datasets [23.686476742398973]
We present methods to automatically produce large-scale datasets of short synthetic video clips.
The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures.
Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets.
arXiv Detail & Related papers (2024-11-26T16:51:11Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities.
Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples.
We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features.
On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features.
Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z) - Quality-Agnostic Deepfake Detection with Intra-model Collaborative
Learning [26.517887637150594]
Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information.
Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type.
We propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes.
arXiv Detail & Related papers (2023-09-12T02:01:31Z) - Deepfake Video Detection Using Generative Convolutional Vision
Transformer [3.8297637120486496]
We propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection.
Our model combines ConvNeXt and Swin Transformer models for feature extraction.
By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos.
arXiv Detail & Related papers (2023-07-13T19:27:40Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - The Effectiveness of Temporal Dependency in Deepfake Video Detection [0.0]
This paper investigates whether temporal information can improve the deepfake performance of deep learning models.
We find that temporal dependency produces a statistically significant increase in performance classifying real images for the model.
arXiv Detail & Related papers (2022-05-13T14:39:25Z) - Model Attribution of Face-swap Deepfake Videos [39.771800841412414]
We first introduce a new dataset with DeepFakes from Different Models (DFDM) based on several Autoencoder models.
Specifically, five generation models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio have been used to generate a total of 6,450 Deepfake videos.
We take Deepfakes model attribution as a multiclass classification task and propose a spatial and temporal attention based method to explore the differences among Deepfakes.
arXiv Detail & Related papers (2022-02-25T20:05:18Z) - Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes.
We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection.
We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z) - Improving the Efficiency and Robustness of Deepfakes Detection through
Precise Geometric Features [13.033517345182728]
Deepfakes is a branch of malicious techniques that transplant a target face to the original one in videos.
Previous efforts for Deepfakes videos detection mainly focused on appearance features, which have a risk of being bypassed by sophisticated manipulation.
We propose an efficient and robust framework named LRNet for detecting Deepfakes videos through temporal modeling on precise geometric features.
arXiv Detail & Related papers (2021-04-09T16:57:55Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Sharp Multiple Instance Learning for DeepFake Video Detection [54.12548421282696]
We introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated.
A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction.
Experiments on FFPMS and widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection.
arXiv Detail & Related papers (2020-08-11T08:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.