Related papers: SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

URL: http://arxiv.org/abs/2409.17285v1
Date: Wed, 18 Sep 2024 23:17:02 GMT
Title: SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe,
Abstract summary: SpoofCeleb is a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV) We utilize source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions.
Score: 76.71096751337888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.

Related papers

Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z)
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z)
Multi-speaker Text-to-speech Training with Speaker Anonymized Data [40.70515431989197]
We investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA) Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data.
arXiv Detail & Related papers (2024-05-20T03:55:44Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.