Related papers: CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection

CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection

URL: http://arxiv.org/abs/2506.21711v1
Date: Thu, 26 Jun 2025 18:51:17 GMT
Title: CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection
Authors: Aryan Thakre, Omkar Nagwekar, Vedang Talekar, Aparna Santra Biswas,
Abstract summary: CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies.<n>We propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features.<n>We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deepfakes have emerged as a significant threat to digital media authenticity, increasing the need for advanced detection techniques that can identify subtle and time-dependent manipulations. CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies. However, many existing CNN-Transformer models process spatial and temporal features independently. In particular, attention-based methods often use separate attention mechanisms for spatial and temporal features and combine them using naive approaches like averaging, addition, or concatenation, which limits the depth of spatio-temporal interaction. To address this challenge, we propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features in a more integrated manner. Our approach allows temporal features to dynamically attend to relevant spatial regions, enhancing the model's ability to detect fine-grained, time-evolving artifacts such as flickering eyes or warped lips. This design enables more precise localization and deeper contextual understanding, leading to improved performance across diverse and challenging scenarios. We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets in both intra- and cross-dataset settings to affirm the superiority of our approach. Our model achieves strong performance with an AUC of 99.49 percent and an accuracy of 97.57 percent in intra-dataset evaluations. In cross-dataset testing, it demonstrates impressive generalization by achieving a 93.31 percent AUC on the unseen DeepfakeDetection dataset. These results highlight the effectiveness of cross-attention-based feature fusion in enhancing the robustness of deepfake video detection.

Related papers

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking [0.0]
Real-time video analysis remains a challenging problem in computer vision.<n>We present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking.<n>Our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds.
arXiv Detail & Related papers (2025-07-30T06:49:11Z)
Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better [63.567886330598945]
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance.<n>Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains.<n>We propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection.
arXiv Detail & Related papers (2025-06-15T08:19:32Z)
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes [9.462613446025001]
Face-fake Deepfake videos pose growing risks to digital security, privacy, and media integrity.<n>FAME is a framework designed to capture subtle artifacts specific to different face-generative models.<n>Results show that FAME consistently outperforms existing methods in both accuracy and runtime.
arXiv Detail & Related papers (2025-06-13T05:47:09Z)
A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities [2.2311172523629637]
This paper presents the first comprehensive survey of deep learning advances in remote sensing STF over the past decade.<n>We establish a taxonomy of deep learning architectures including CNNs, Transformers, Generative Adrial Networks (GANs), diffusion models, and sequence models.<n>We identify five critical challenges: time-space conflicts, generalization across datasets, computational efficiency for large-scale processing, multi-source heterogeneous fusion, and insufficient benchmark diversity.
arXiv Detail & Related papers (2025-04-01T15:30:48Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
YOLO-ELA: Efficient Local Attention Modeling for High-Performance Real-Time Insulator Defect Detection [0.0]
Existing detection methods for insulator defect identification from unmanned aerial vehicles struggle with complex background scenes and small objects. This paper proposes a new attention-based foundation architecture, YOLO-ELA, to address this issue. Experimental results on high-resolution UAV images show that our method achieved a state-of-the-art performance of 96.9% mAP0.5 and a real-time detection speed of 74.63 frames per second.
arXiv Detail & Related papers (2024-10-15T16:00:01Z)
Spiking Transformer with Spatial-Temporal Attention [26.7175155847563]
Spike-based Transformer presents a compelling and energy-efficient alternative to traditional Artificial Neural Network (ANN)-based Transformers.<n>We propose Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture that efficiently integrates both spatial and temporal information in the self-attention mechanism.<n>Our method can be seamlessly integrated into existing spike-based transformers without architectural overhaul.
arXiv Detail & Related papers (2024-09-29T20:29:39Z)
UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z)
Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model [16.69101880602321]
We propose a novel side-network-based decoder for generalized video-based Deepfake detection.<n>We also introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability.<n>Our approach demonstrates promising generalizability on challenging Deepfake datasets.
arXiv Detail & Related papers (2024-04-08T14:58:52Z)
Detecting Anomalies in Dynamic Graphs via Memory enhanced Normality [39.476378833827184]
Anomaly detection in dynamic graphs presents a significant challenge due to the temporal evolution of graph structures and attributes. We introduce a novel spatial- temporal memories-enhanced graph autoencoder (STRIPE) STRIPE significantly outperforms existing methods with 5.8% improvement in AUC scores and 4.62X faster in training time.
arXiv Detail & Related papers (2024-03-14T02:26:10Z)
CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition [53.860796916196634]
We propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF) Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts. It adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination.
arXiv Detail & Related papers (2023-09-30T12:30:25Z)
STC-IDS: Spatial-Temporal Correlation Feature Analyzing based Intrusion Detection System for Intelligent Connected Vehicles [7.301018758489822]
We present a novel model for automotive intrusion detection by spatial-temporal correlation features of in-vehicle communication traffic (STC-IDS) Specifically, the proposed model exploits an encoding-detection architecture. In the encoder part, spatial and temporal relations are encoded simultaneously. The encoded information is then passed to the detector for generating forceful spatial-temporal attention features and enabling anomaly classification.
arXiv Detail & Related papers (2022-04-23T04:22:58Z)
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information. We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC) First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.