3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization
- URL: http://arxiv.org/abs/2511.21237v2
- Date: Mon, 01 Dec 2025 07:06:39 GMT
- Title: 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization
- Authors: Shuhan Xia, Xuannan Liu, Xing Cui, Peipei Li,
- Abstract summary: T3-Tracer is a framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces.<n>FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies.<n>It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries.
- Score: 15.253944377996477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.
Related papers
- HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection [4.679561335065019]
Anomaly detection crucial in industrial product quality inspection.<n>Existing methods face a structure-semantics trade-off.<n>HarmoniAD is a frequency-guided dual-branch framework.
arXiv Detail & Related papers (2026-01-01T12:45:45Z) - Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation [60.9960601057956]
We introduce Frequency-Aware Audio-Visualcomposer (FAVS) framework consisting of two key modules.<n>FAVS framework achieves state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2025-09-23T12:33:48Z) - TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph [28.536724593429398]
TEn-CATG is a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning.<n>We show that TEn-CATG achieves robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
arXiv Detail & Related papers (2025-09-04T10:32:40Z) - Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation [21.117226880898418]
We propose a novel framework designed to enhance noise robustness in ultrasound video segmentation.<n>The Dual Semantic-Aware Network (DSANet) fosters mutual semantic awareness between local and global features.<n>Our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models.
arXiv Detail & Related papers (2025-07-10T05:41:17Z) - CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation [24.952907733127223]
We propose a general framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD)<n>CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates mismatchs while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio)
arXiv Detail & Related papers (2025-05-21T08:11:07Z) - Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization [60.899082019130766]
We introduce a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization.
FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions.
PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN.
arXiv Detail & Related papers (2024-07-23T15:07:52Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Frequency Perception Network for Camouflaged Object Detection [51.26386921922031]
We propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain.<n>Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.<n>Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets.
arXiv Detail & Related papers (2023-08-17T11:30:46Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - TranssionADD: A multi-frame reinforcement based sequence tagging model
for audio deepfake detection [11.27584658526063]
The second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances.
We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers.
Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system.
arXiv Detail & Related papers (2023-06-27T05:18:25Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.