Related papers: Generalizable Audio Spoofing Detection using Non-Semantic Representations

Generalizable Audio Spoofing Detection using Non-Semantic Representations

URL: http://arxiv.org/abs/2509.00186v1
Date: Fri, 29 Aug 2025 18:37:57 GMT
Title: Generalizable Audio Spoofing Detection using Non-Semantic Representations
Authors: Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller,
Abstract summary: generative modeling has made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks.<n>Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data.<n>This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations.
Score: 12.685819931453045
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.

Related papers

FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z)
Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching [8.466707742593078]
Speech deepfakes are synthetic audio signals that can imitate target speakers' voices.<n>Existing methods for detecting speech deepfakes rely on supervised learning.<n>We introduce a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task.
arXiv Detail & Related papers (2025-03-23T11:15:22Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification [73.30974350776636]
This paper comprehensively compares mainstream purification techniques in a unified framework. We propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection.
arXiv Detail & Related papers (2023-12-14T03:04:05Z)
Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection [14.449940985934388]
Existing deepfake speech detection systems lack generalizability to unseen attacks. Recent studies have explored the use of universal speech representations to tackle this issue. We argue that characterizing the long-term temporal dynamics of these representations is crucial for generalizability.
arXiv Detail & Related papers (2023-09-15T01:37:45Z)
Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations. The proposed approach can be implemented based on off-the-shelf speaker verification tools. We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z)
Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection [41.27496491339225]
This work addresses the generalizable deepfake detection from a simple principle. We propose to enrich the "diversity" of forgeries by synthesizing augmented forgeries with a pool of forgery configurations. We also propose to use the adversarial training strategy to dynamically synthesize the most challenging forgeries to the current model.
arXiv Detail & Related papers (2022-03-23T05:52:23Z)
Enhancing the Generalization for Intent Classification and Out-of-Domain Detection in SLU [70.44344060176952]
Intent classification is a major task in spoken language understanding (SLU) Recent works have shown that using extra data and labels can improve the OOD detection performance. This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection.
arXiv Detail & Related papers (2021-06-28T08:27:38Z)
Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection [60.88952532574564]
This paper conducts a thorough comparison of out-of-domain intent detection methods. We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification. Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results.
arXiv Detail & Related papers (2021-01-11T09:10:58Z)
Better Fine-Tuning by Reducing Representational Collapse [77.44854918334232]
Existing approaches for fine-tuning pre-trained language models have been shown to be unstable. We present a method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise. We show it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
arXiv Detail & Related papers (2020-08-06T02:13:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.