Related papers: Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech

Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech

URL: http://arxiv.org/abs/2011.11818v1
Date: Tue, 24 Nov 2020 00:48:54 GMT
Title: Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech
Authors: Yiling Huang, Yutian Chen, Jason Pelecanos, Quan Wang
Abstract summary: We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance. We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
Score: 8.465993273653554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, Text-To-Speech (TTS) has been used as a data augmentation technique for speech recognition to help complement inadequacies in the training data. Correspondingly, we investigate the use of a multi-speaker TTS system to synthesize speech in support of speaker recognition. In this study we focus the analysis on tasks where a relatively small number of speakers is available for training. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance and can be combined effectively with multi-style training. Additionally, we explore the effectiveness of different types of text transcripts used for TTS synthesis. Results suggest that matching the textual content of the target domain is a good practice, and if that is not feasible, a transcript with a sufficiently large vocabulary is recommended.

Related papers

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects. We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z)
kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech [18.701864254184308]
kNN-TTS is a simple and effective framework for zero-shot multi-speaker text-to-speech. Our models, trained on transcribed speech from a single speaker, achieve performance comparable to state-of-the-art models. We also introduce a parameter which enables fine-grained voice morphing.
arXiv Detail & Related papers (2024-08-20T12:09:58Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Towards Selection of Text-to-speech Data to Augment ASR Training [20.115236045164355]
We train a neural network to measure the similarity of a synthetic data to real speech. We find that incorporating synthetic samples with considerable dissimilarity to real speech is crucial for boosting recognition performance.
arXiv Detail & Related papers (2023-05-30T17:24:28Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus. We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.