Related papers: CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos

CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos

URL: http://arxiv.org/abs/2311.03782v4
Date: Thu, 12 Jun 2025 08:51:28 GMT
Title: CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos
Authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure, Sarwar Khan,
Abstract summary: Attributing a deep-fake to its specific generation model or encoder is vital for forensic analysis, enabling source and tailored countermeasures.<n>We investigate the model attribution problem for deep-fake videos using two datasets: Deepfakes from Different Models (DFDM) and GANGen-Detection.<n>We introduce a novel Capsule-Spatial-Cap (CapST) model that integrates a truncated VGG19 network for feature extraction, capsule networks for temporal extraction.
Score: 9.209808258321559
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep-fake videos, generated through AI face-swapping techniques, have gained significant attention due to their potential for impactful impersonation attacks. While most research focuses on real vs. fake detection, attributing a deep-fake to its specific generation model or encoder is vital for forensic analysis, enabling source tracing and tailored countermeasures. This enhances detection by leveraging model-specific artifacts and supports proactive defenses. We investigate the model attribution problem for deep-fake videos using two datasets: Deepfakes from Different Models (DFDM) and GANGen-Detection, both comprising deep-fake videos and GAN-generated images. We use only fake images from GANGen-Detection to align with DFDM's focus on attribution rather than binary classification. We formulate the task as a multiclass classification problem and introduce a novel Capsule-Spatial-Temporal (CapST) model that integrates a truncated VGG19 network for feature extraction, capsule networks for hierarchical encoding, and a spatio-temporal attention mechanism. Video-level fusion captures temporal dependencies across frames. Experiments on DFDM and GANGen-Detection show CapST outperforms baseline models in attribution accuracy while reducing computational cost.

Related papers

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes [9.462613446025001]
Face-fake Deepfake videos pose growing risks to digital security, privacy, and media integrity.<n>FAME is a framework designed to capture subtle artifacts specific to different face-generative models.<n>Results show that FAME consistently outperforms existing methods in both accuracy and runtime.
arXiv Detail & Related papers (2025-06-13T05:47:09Z)
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset. Our model achieves 8.5x improvements in generation speed compared to the teacher model. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z)
Deepfake Detection with Spatio-Temporal Consistency and Attention [46.1135899490656]
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Current methods for detecting forged videos rely mainly on global frame features. We propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos.
arXiv Detail & Related papers (2025-02-12T08:51:33Z)
Pre-training for Action Recognition with Automatically Generated Fractal Datasets [23.686476742398973]
We present methods to automatically produce large-scale datasets of short synthetic video clips. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets.
arXiv Detail & Related papers (2024-11-26T16:51:11Z)
UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos [16.34393937800271]
generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. We propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models.
arXiv Detail & Related papers (2024-06-13T21:52:49Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features. On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features. Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z)
Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning [26.517887637150594]
Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information. Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type. We propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes.
arXiv Detail & Related papers (2023-09-12T02:01:31Z)
Deepfake Video Detection Using Generative Convolutional Vision Transformer [3.8297637120486496]
We propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction. By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos.
arXiv Detail & Related papers (2023-07-13T19:27:40Z)
Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z)
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. In this study, we focus on transferring knowledge for video classification tasks. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z)
The Effectiveness of Temporal Dependency in Deepfake Video Detection [0.0]
This paper investigates whether temporal information can improve the deepfake performance of deep learning models. We find that temporal dependency produces a statistically significant increase in performance classifying real images for the model.
arXiv Detail & Related papers (2022-05-13T14:39:25Z)
Model Attribution of Face-swap Deepfake Videos [39.771800841412414]
We first introduce a new dataset with DeepFakes from Different Models (DFDM) based on several Autoencoder models. Specifically, five generation models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio have been used to generate a total of 6,450 Deepfake videos. We take Deepfakes model attribution as a multiclass classification task and propose a spatial and temporal attention based method to explore the differences among Deepfakes.
arXiv Detail & Related papers (2022-02-25T20:05:18Z)
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis [69.09526348527203]
Deep generative models have led to highly realistic media, known as deepfakes, that are commonly indistinguishable from real to human eyes. We propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection. We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios.
arXiv Detail & Related papers (2021-05-29T21:22:24Z)
Improving the Efficiency and Robustness of Deepfakes Detection through Precise Geometric Features [13.033517345182728]
Deepfakes is a branch of malicious techniques that transplant a target face to the original one in videos. Previous efforts for Deepfakes videos detection mainly focused on appearance features, which have a risk of being bypassed by sophisticated manipulation. We propose an efficient and robust framework named LRNet for detecting Deepfakes videos through temporal modeling on precise geometric features.
arXiv Detail & Related papers (2021-04-09T16:57:55Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples. We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
Sharp Multiple Instance Learning for DeepFake Video Detection [54.12548421282696]
We introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction. Experiments on FFPMS and widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection.
arXiv Detail & Related papers (2020-08-11T08:52:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.