Related papers: A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

URL: http://arxiv.org/abs/2303.02719v2
Date: Mon, 10 Jul 2023 15:15:47 GMT
Title: A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, \'Eva Sz\'ekely
Abstract summary: We show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS.
Score: 12.53269106994881
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Related papers

A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings [9.764748000637082]
Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech.
arXiv Detail & Related papers (2024-12-11T10:58:14Z)
SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [18.701864254184308]
Self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker.
arXiv Detail & Related papers (2024-08-20T12:09:58Z)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z)
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis [12.53269106994881]
Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. We show that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. We extend the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL.
arXiv Detail & Related papers (2023-07-11T09:22:10Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models. We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z)
The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning. We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models. Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z)
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation. Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction. In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We investigated training TTS from either manual or ASR-generated transcripts. Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.