A Comparative Study of Self-Supervised Speech Representations in Read
and Spontaneous TTS
- URL: http://arxiv.org/abs/2303.02719v2
- Date: Mon, 10 Jul 2023 15:15:47 GMT
- Title: A Comparative Study of Self-Supervised Speech Representations in Read
and Spontaneous TTS
- Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, \'Eva Sz\'ekely
- Abstract summary: We show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS.
Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS.
- Score: 12.53269106994881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has explored using self-supervised learning (SSL) speech
representations such as wav2vec2.0 as the representation medium in standard
two-stage TTS, in place of conventionally used mel-spectrograms. It is however
unclear which speech SSL is the better fit for TTS, and whether or not the
performance differs between read and spontaneous TTS, the later of which is
arguably more challenging. This study aims at addressing these questions by
testing several speech SSLs, including different layers of the same SSL, in
two-stage TTS on both read and spontaneous corpora, while maintaining constant
TTS model architecture and training settings. Results from listening tests show
that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other
tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work
sheds light on both how speech SSL can readily improve current TTS systems, and
how SSLs compare in the challenging generative task of TTS. Audio examples can
be found at https://www.speech.kth.se/tts-demos/ssr_tts
Related papers
- SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [18.701864254184308]
Self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS.
In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker.
arXiv Detail & Related papers (2024-08-20T12:09:58Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - On the Use of Self-Supervised Speech Representations in Spontaneous
Speech Synthesis [12.53269106994881]
Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications.
We show that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech.
We extend the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL.
arXiv Detail & Related papers (2023-07-11T09:22:10Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.