Generalizable Audio Spoofing Detection using Non-Semantic Representations
- URL: http://arxiv.org/abs/2509.00186v1
- Date: Fri, 29 Aug 2025 18:37:57 GMT
- Title: Generalizable Audio Spoofing Detection using Non-Semantic Representations
- Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller,
- Abstract summary: generative modeling has made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks.<n>Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data.<n>This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations.
- Score: 12.685819931453045
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.
Related papers
- FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z) - Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching [8.466707742593078]
Speech deepfakes are synthetic audio signals that can imitate target speakers' voices.<n>Existing methods for detecting speech deepfakes rely on supervised learning.<n>We introduce a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task.
arXiv Detail & Related papers (2025-03-23T11:15:22Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Scalable Ensemble-based Detection Method against Adversarial Attacks for
speaker verification [73.30974350776636]
This paper comprehensively compares mainstream purification techniques in a unified framework.
We propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection.
arXiv Detail & Related papers (2023-12-14T03:04:05Z) - Characterizing the temporal dynamics of universal speech representations
for generalizable deepfake detection [14.449940985934388]
Existing deepfake speech detection systems lack generalizability to unseen attacks.
Recent studies have explored the use of universal speech representations to tackle this issue.
We argue that characterizing the long-term temporal dynamics of these representations is crucial for generalizability.
arXiv Detail & Related papers (2023-09-15T01:37:45Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - Self-supervised Learning of Adversarial Example: Towards Good
Generalizations for Deepfake Detection [41.27496491339225]
This work addresses the generalizable deepfake detection from a simple principle.
We propose to enrich the "diversity" of forgeries by synthesizing augmented forgeries with a pool of forgery configurations.
We also propose to use the adversarial training strategy to dynamically synthesize the most challenging forgeries to the current model.
arXiv Detail & Related papers (2022-03-23T05:52:23Z) - Enhancing the Generalization for Intent Classification and Out-of-Domain
Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU)
Recent works have shown that using extra data and labels can improve the OOD detection performance.
This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z) - Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain
Detection [60.88952532574564]
This paper conducts a thorough comparison of out-of-domain intent detection methods.
We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification.
Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results.
arXiv Detail & Related papers (2021-01-11T09:10:58Z) - Better Fine-Tuning by Reducing Representational Collapse [77.44854918334232]
Existing approaches for fine-tuning pre-trained language models have been shown to be unstable.
We present a method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise.
We show it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
arXiv Detail & Related papers (2020-08-06T02:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.