Related papers: A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

URL: http://arxiv.org/abs/2603.01482v1
Date: Mon, 02 Mar 2026 05:45:55 GMT
Title: A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
Authors: Hashim Ali, Nithin Sai Adupa, Surya Subramani, Hafiz Malik,
Abstract summary: Spoof-SUPERB is a benchmark for audio deepfake detection.<n>We evaluate 20 SSL models spanning generative, discriminative, and spectrogram-based architectures.
Score: 2.432576583937997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Related papers

FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [3.4355593397388597]
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models. We find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER.
arXiv Detail & Related papers (2024-08-14T23:33:10Z)
Robustness of Speech Separation Models for Similar-pitch Speakers [14.941946672578863]
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
arXiv Detail & Related papers (2024-07-22T15:55:08Z)
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations [16.269123889392343]
This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models consistently outperform comparable self-supervised audio spectrogram transformer baselines.
arXiv Detail & Related papers (2024-06-04T10:19:14Z)
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective. Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference. We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space. Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.