Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features
- URL: http://arxiv.org/abs/2211.00342v2
- Date: Sun, 7 May 2023 13:43:50 GMT
- Title: Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features
- Authors: Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung,
Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis
- Abstract summary: Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
- Score: 54.48824266041105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art methods for automatic synthetic speech evaluation
are based on MOS prediction neural models. Such MOS prediction models include
MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies
on a pretrained self-supervised learning model that directly uses the speech
signal as input. In modern high-quality neural TTS systems, prosodic
appropriateness with regard to the spoken content is a decisive factor for
speech naturalness. For this reason, we propose to include prosodic and
linguistic features as additional inputs in MOS prediction systems, and
evaluate their impact on the prediction outcome. We consider phoneme level F0
and duration features as prosodic inputs, as well as Tacotron encoder outputs,
POS tags and BERT embeddings as higher-level linguistic inputs. All MOS
prediction systems are trained on SOMOS, a neural TTS-only dataset with
crowdsourced naturalness MOS evaluations. Results show that the proposed
additional features are beneficial in the MOS prediction task, by improving the
predicted MOS scores' correlation with the ground truths, both at
utterance-level and system-level predictions.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction
in Text-to-Speech for Low-Resource Languages [1.1852406625172218]
We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS.
Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot prediction.
arXiv Detail & Related papers (2023-05-30T20:19:56Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training
and Distribution of Opinion Scores [64.37977826069105]
Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems.
In this work, we propose DDOS, a novel MOS prediction model.
DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech.
arXiv Detail & Related papers (2022-04-07T05:04:10Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Learning to Maximize Speech Quality Directly Using MOS Prediction for
Neural Text-to-Speech [15.796199345773873]
We propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss.
We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech.
The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation.
arXiv Detail & Related papers (2020-11-02T18:13:48Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.