VoiceWukong: Benchmarking Deepfake Voice Detection
- URL: http://arxiv.org/abs/2409.06348v1
- Date: Tue, 10 Sep 2024 09:07:12 GMT
- Title: VoiceWukong: Benchmarking Deepfake Voice Detection
- Authors: Ziwei Yan, Yanjie Zhao, Haoyu Wang,
- Abstract summary: We present VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors.
To build the dataset, we first collected deepfake voices generated by 19 commercial tools and 15 open-source tools.
We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection.
- Score: 6.8595368524357285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within.
Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples.
In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z) - Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 [8.940008511570207]
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
arXiv Detail & Related papers (2024-09-03T21:28:45Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - System Fingerprint Recognition for Deepfake Audio: An Initial Dataset
and Investigation [51.06875680387692]
We present the first deepfake audio dataset for system fingerprint recognition (SFR)
We collected the dataset from the speech synthesis systems of seven Chinese vendors that use the latest state-of-the-art deep learning technologies.
arXiv Detail & Related papers (2022-08-21T05:15:40Z) - On the pragmatism of using binary classifiers over data intensive neural
network classifiers for detection of COVID-19 from voice [34.553128768223615]
We show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers.
We demonstrate this from a human-curated dataset collected and calibrated in clinical settings.
arXiv Detail & Related papers (2022-04-11T00:19:14Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Does Audio Deepfake Detection Generalize? [6.415366195115544]
We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work.
We publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes.
This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.
arXiv Detail & Related papers (2022-03-30T12:48:22Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal
and Multimodal Detectors [18.862258543488355]
Deepfakes can cause security and privacy issues.
New domain of cloning human voices using deep-learning technologies is also emerging.
To develop a good deepfake detector, we need a detector that can detect deepfakes of multiple modalities.
arXiv Detail & Related papers (2021-09-07T11:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.