On the Use of Self-Supervised Speech Representations in Spontaneous
Speech Synthesis
- URL: http://arxiv.org/abs/2307.05132v1
- Date: Tue, 11 Jul 2023 09:22:10 GMT
- Title: On the Use of Self-Supervised Speech Representations in Spontaneous
Speech Synthesis
- Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, \'Eva Sz\'ekely
- Abstract summary: Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications.
We show that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech.
We extend the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL.
- Score: 12.53269106994881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) speech representations learned from large
amounts of diverse, mixed-quality speech data without transcriptions are
gaining ground in many speech technology applications. Prior work has shown
that SSL is an effective intermediate representation in two-stage
text-to-speech (TTS) for both read and spontaneous speech. However, it is still
not clear which SSL and which layer from each SSL model is most suited for
spontaneous TTS. We address this shortcoming by extending the scope of
comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within
each SSL. Furthermore, SSL has also shown potential in predicting the mean
opinion scores (MOS) of synthesized speech, but this has only been done in
read-speech MOS prediction. We extend an SSL-based MOS prediction framework
previously developed for scoring read speech synthesis and evaluate its
performance on synthesized spontaneous speech. All experiments are conducted
twice on two different spontaneous corpora in order to find generalizable
trends. Overall, we present comprehensive experimental results on the use of
SSL in spontaneous TTS and MOS prediction to further quantify and understand
how SSL can be used in spontaneous TTS. Audios samples:
https://www.speech.kth.se/tts-demos/sp_ssl_tts
Related papers
- SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [18.701864254184308]
Self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS.
In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker.
arXiv Detail & Related papers (2024-08-20T12:09:58Z) - What Do Self-Supervised Speech and Speaker Models Learn? New Findings
From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations.
Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z) - SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? [45.901645659694935]
Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks.
In this paper, we aim to clarify if speech SSL techniques can well capture linguistic knowledge.
arXiv Detail & Related papers (2023-06-14T09:04:29Z) - A Comparative Study of Self-Supervised Speech Representations in Read
and Spontaneous TTS [12.53269106994881]
We show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS.
Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS.
arXiv Detail & Related papers (2023-03-05T17:20:10Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.