Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction
in Text-to-Speech for Low-Resource Languages
- URL: http://arxiv.org/abs/2305.19396v1
- Date: Tue, 30 May 2023 20:19:56 GMT
- Title: Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction
in Text-to-Speech for Low-Resource Languages
- Authors: Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers
- Abstract summary: We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS.
Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot prediction.
- Score: 1.1852406625172218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We train a MOS prediction model based on wav2vec 2.0 using the open-access
data sets BVCC and SOMOS. Our test with neural TTS data in the low-resource
language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning
on SOMOS leads to the best accuracy for both fine-tuned and zero-shot
prediction. Further fine-tuning experiments show that using more than 30
percent of the total data does not lead to significant improvements. In
addition, fine-tuning with data from a single listener shows promising
system-level accuracy, supporting the viability of one-participant pilot tests.
These findings can all assist the resource-conscious development of TTS for
LRLs by progressing towards better zero-shot MOS prediction and informing the
design of listening tests, especially in early-stage evaluation.
Related papers
- Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Improving Neural Machine Translation by Denoising Training [95.96569884410137]
We present a simple and effective pretraining strategy Denoising Training DoT for neural machine translation.
We update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally.
Experiments show DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions.
arXiv Detail & Related papers (2022-01-19T00:11:38Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - Learning to Maximize Speech Quality Directly Using MOS Prediction for
Neural Text-to-Speech [15.796199345773873]
We propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss.
We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech.
The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation.
arXiv Detail & Related papers (2020-11-02T18:13:48Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.