Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
- URL: http://arxiv.org/abs/2409.02302v1
- Date: Tue, 3 Sep 2024 21:28:45 GMT
- Title: Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
- Authors: Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, Qiongqiong Wang,
- Abstract summary: This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
- Score: 8.940008511570207
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - VoiceWukong: Benchmarking Deepfake Voice Detection [6.8595368524357285]
We present VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors.
To build the dataset, we first collected deepfake voices generated by 19 commercial tools and 15 open-source tools.
We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection.
arXiv Detail & Related papers (2024-09-10T09:07:12Z) - Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection [0.0]
ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition consists of a stand-alone speech deepfake (bonafide vs spoof) detection task.
We leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques.
Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.
arXiv Detail & Related papers (2024-09-08T08:54:36Z) - OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Retrieval-Augmented Audio Deepfake Detection [27.13059118273849]
We propose a retrieval-augmented detection framework that augments test samples with similar retrieved samples for enhanced detection.
Experiments show the superior performance of the proposed RAD framework over baseline methods.
arXiv Detail & Related papers (2024-04-22T05:46:40Z) - PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response [14.604998731837595]
PITCH is a robust challenge-response method to detect and tag interactive deepfake audio calls.
We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors.
Our solution gave users maximum control and boosted detection accuracy to 84.5%.
arXiv Detail & Related papers (2024-02-28T06:17:55Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Exploring linguistic feature and model combination for speech
recognition based automatic AD detection [61.91708957996086]
Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques.
Scarcity of specialist data leads to uncertainty in both model selection and feature learning when developing such systems.
This paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders.
arXiv Detail & Related papers (2022-06-28T05:09:01Z) - On the pragmatism of using binary classifiers over data intensive neural
network classifiers for detection of COVID-19 from voice [34.553128768223615]
We show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers.
We demonstrate this from a human-curated dataset collected and calibrated in clinical settings.
arXiv Detail & Related papers (2022-04-11T00:19:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.