Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection
- URL: http://arxiv.org/abs/2507.15101v1
- Date: Sun, 20 Jul 2025 19:46:23 GMT
- Title: Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection
- Authors: Menglu Li, Xiao-Ping Zhang, Lian Zhao,
- Abstract summary: We propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations.<n>A dual-level hierarchical difference representation captures temporal irregularities at both fine and coarse scales, while adaptive average pooling preserves essential patterns across variable-length inputs to minimize information loss.<n>Our TDAM-AvgPool model achieves state-of-the-art performance, with an EER of 0.59% on the PartialSpoof dataset and 0.03% on the HAD dataset, which significantly outperforms the existing methods without requiring frame-level supervision.
- Score: 16.923285534924116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical difference representation captures temporal irregularities at both fine and coarse scales, while adaptive average pooling preserves essential patterns across variable-length inputs to minimize information loss. Our TDAM-AvgPool model achieves state-of-the-art performance, with an EER of 0.59% on the PartialSpoof dataset and 0.03% on the HAD dataset, which significantly outperforms the existing methods without requiring frame-level supervision.
Related papers
- Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization [60.73623588349311]
We propose a universal context-aware contrastive learning framework (UniCaCLF) for temporal forgery localization.<n>Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection.<n>An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants.
arXiv Detail & Related papers (2025-06-10T06:40:43Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization [13.840950434728533]
DiMoDif is an audio-visual deepfake detection framework.<n>It exploits the inter-modality differences in machine perception of speech.<n>It temporally localizes the deepfake forgery.
arXiv Detail & Related papers (2024-11-15T13:47:33Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization.
FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions.
PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z) - Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection [41.35861722481721]
Deepfake threats to society and cybersecurity have provoked significant public apprehension.
This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL)
TALL transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
arXiv Detail & Related papers (2024-03-15T12:48:44Z) - CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition [53.860796916196634]
We propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF)
Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts.
It adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination.
arXiv Detail & Related papers (2023-09-30T12:30:25Z) - Generalizable Deepfake Detection with Phase-Based Motion Analysis [11.042856247812969]
We propose PhaseForensics, a DeepFake (DF) video detection method that leverages a phase-based motion representation of temporal dynamics.
We show improved distortion and adversarial robustness, and state-of-the-art cross-dataset generalization, with 91.2% video-level AUC on the challenging CelebDFv2.
arXiv Detail & Related papers (2022-11-17T06:28:01Z) - Delving into Sequential Patches for Deepfake Detection [64.19468088546743]
Recent advances in face forgery techniques produce nearly untraceable deepfake videos, which could be leveraged with malicious intentions.
Previous studies has identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods.
We propose the Local- & Temporal-aware Transformer-based Deepfake Detection framework, which adopts a local-to-global learning protocol.
arXiv Detail & Related papers (2022-07-06T16:46:30Z) - Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset
and Multimodal Method for Temporal Forgery Localization [19.490174583625862]
We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF)
Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video.
Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.
arXiv Detail & Related papers (2022-04-13T08:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.