Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection
- URL: http://arxiv.org/abs/2507.20417v1
- Date: Sun, 27 Jul 2025 21:22:27 GMT
- Title: Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection
- Authors: Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Guttierez, Tim Polzehl, Sebastian Möller,
- Abstract summary: Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks.<n>Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features, are vulnerable to non spoof disturbances.<n>We investigate hybrid fusion frameworks that integrate self supervised learning (SSL) based representations with handcrafted spectral descriptors.
- Score: 11.121265242990166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features, are vulnerable to non spoof disturbances and often overfit to known forgery algorithms, resulting in poor generalization to unseen attacks. To address these shortcomings, we investigate hybrid fusion frameworks that integrate self supervised learning (SSL) based representations with handcrafted spectral descriptors (MFCC , LFCC, CQCC). By aligning and combining complementary information across modalities, these fusion approaches capture subtle artifacts that single feature approaches typically overlook. We explore several fusion strategies, including simple concatenation, cross attention, mutual cross attention, and a learnable gating mechanism, to optimally blend SSL features with fine grained spectral cues. We evaluate our approach on four challenging public benchmarks and report generalization performance. All fusion variants consistently outperform an SSL only baseline, with the cross attention strategy achieving the best generalization with a 38% relative reduction in equal error rate (EER). These results confirm that joint modeling of waveform and spectral views produces robust, domain agnostic representations for audio deepfake detection.
Related papers
- CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [54.85000884785013]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z) - CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation [24.952907733127223]
We propose a general framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD)<n>CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates mismatchs while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio)
arXiv Detail & Related papers (2025-05-21T08:11:07Z) - SpecSphere: Dual-Pass Spectral-Spatial Graph Neural Networks with Certified Robustness [1.7495213911983414]
We introduce SpecSphere, the first dual-pass spectral-spatial GNN that certifies every prediction against both $ell_0$ edge flips and $ell_inftyversa feature perturbations.<n>Our model couples a Chebyshev-polynomial spectral branch with an attention-gated spatial branch and fuses their representations through a lightweight trained in a cooperative-adrial min-max game.
arXiv Detail & Related papers (2025-05-13T08:00:16Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.<n> SSL features conflict with traditional spectral features like FBanks in terms of update directions.<n>We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z) - Spectrum-oriented Point-supervised Saliency Detector for Hyperspectral Images [13.79887292039637]
We introduce point supervision into Hyperspectral salient object detection (HSOD)<n>We incorporate Spectral Saliency, derived from conventional HSOD methods, as a pivotal spectral representation within the framework.<n>We propose a novel pipeline, specifically designed for HSIs, to generate pseudo-labels, effectively mitigating the performance decline associated with point supervision strategy.
arXiv Detail & Related papers (2024-12-24T02:52:43Z) - Hyperspectral Image Reconstruction via Combinatorial Embedding of
Cross-Channel Spatio-Spectral Clues [6.580484964018551]
Existing learning-based hyperspectral reconstruction methods show limitations in fully exploiting the information among the hyperspectral bands.
We propose to investigate the inter-dependencies in their respective hyperspectral space.
These embedded features can be fully exploited by querying the inter-channel correlations.
arXiv Detail & Related papers (2023-12-18T11:37:19Z) - Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on
Multi-Order Spectrograms [19.514932118278523]
We propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations.
A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss.
Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset.
arXiv Detail & Related papers (2023-08-18T04:51:15Z) - PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant
Semantic Segmentation [50.556961575275345]
We propose a perception-aware fusion framework to promote segmentation robustness in adversarial scenes.
We show that our scheme substantially enhances the robustness, with gains of 15.3% mIOU, compared with advanced competitors.
arXiv Detail & Related papers (2023-08-08T01:55:44Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Attack Agnostic Dataset: Towards Generalization and Stabilization of
Audio DeepFake Detection [0.4511923587827302]
Methods for detecting audio DeepFakes should be characterized by good generalization and stability.
We present a thorough analysis of current DeepFake detection methods and consider different audio features (front-ends)
We propose a model based on LCNN with LFCC and mel-spectrogram front-end, which shows improvement over LFCC-based mode.
arXiv Detail & Related papers (2022-06-27T12:30:44Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.