Related papers: Multi-level SSL Feature Gating for Audio Deepfake Detection

Multi-level SSL Feature Gating for Audio Deepfake Detection

URL: http://arxiv.org/abs/2509.03409v1
Date: Wed, 03 Sep 2025 15:37:52 GMT
Title: Multi-level SSL Feature Gating for Audio Deepfake Detection
Authors: Hoan My Tran, Damien Lolive, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau, David Guennec,
Abstract summary: Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech.<n>These innovations pose significant risks, including misuse for fraudulent activities, identity theft, and security threats.<n>Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages.<n>We propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor.
Score: 4.053610356853999
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.

Related papers

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study [6.567506441691872]
We introduce a new Multi-speaker Conversational Audio Deepfakes dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations.<n>We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR)<n>Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices
arXiv Detail & Related papers (2026-01-30T20:38:10Z)
Backdoor Attacks Against Speech Language Models [63.07317091368079]
We present the first systematic study of audio backdoor attacks against speech language models.<n>We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks.<n>We propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
arXiv Detail & Related papers (2025-10-01T17:45:04Z)
Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints [11.703509488782345]
We introduce a training-free, yet effective approach for detecting AI-generated speech.<n>We tackle three key tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) detection of synthetic versus real speech.
arXiv Detail & Related papers (2024-11-21T10:55:49Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection [18.429817510387473]
Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
arXiv Detail & Related papers (2023-07-28T13:50:25Z)
NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection [50.33525966541906]
Existing multimodal detection methods capture audio-visual inconsistencies to expose Deepfake videos. We propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Our model can be easily adapted to the downstream Deepfake datasets with fine-tuning.
arXiv Detail & Related papers (2023-06-12T06:06:05Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations. The proposed approach can be implemented based on off-the-shelf speaker verification tools. We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z)
DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations [2.8038382295783943]
The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention. We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
arXiv Detail & Related papers (2022-06-23T16:45:50Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS) Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.