Improving Self-Supervised Learning-based MOS Prediction Networks
- URL: http://arxiv.org/abs/2204.11030v1
- Date: Sat, 23 Apr 2022 09:19:16 GMT
- Title: Improving Self-Supervised Learning-based MOS Prediction Networks
- Authors: B\'alint Gyires-T\'oth, Csaba Zaink\'o
- Abstract summary: The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: MOS (Mean Opinion Score) is a subjective method used for the evaluation of a
system's quality. Telecommunications (for voice and video), and speech
synthesis systems (for generated speech) are a few of the many applications of
the method. While MOS tests are widely accepted, they are time-consuming and
costly since human input is required. In addition, since the systems and
subjects of the tests differ, the results are not really comparable. On the
other hand, a large number of previous tests allow us to train machine learning
models that are capable of predicting MOS value. By automatically predicting
MOS values, both the aforementioned issues can be resolved.
The present work introduces data-, training- and post-training specific
improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and
non-linear dense layers. We introduced transfer learning, target data
preprocessing a two- and three-phase training method with different batch
formulations, dropout accumulation (for larger batch sizes) and quantization of
the predictions.
The methods are evaluated using the shared synthetic speech dataset of the
first Voice MOS challenge.
Related papers
- Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training
and Distribution of Opinion Scores [64.37977826069105]
Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems.
In this work, we propose DDOS, a novel MOS prediction model.
DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech.
arXiv Detail & Related papers (2022-04-07T05:04:10Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.