DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training
and Distribution of Opinion Scores
- URL: http://arxiv.org/abs/2204.03219v1
- Date: Thu, 7 Apr 2022 05:04:10 GMT
- Title: DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training
and Distribution of Opinion Scores
- Authors: Wei-Cheng Tseng, Wei-Tsung Kao, Hung-yi Lee
- Abstract summary: Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems.
In this work, we propose DDOS, a novel MOS prediction model.
DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech.
- Score: 64.37977826069105
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mean opinion score (MOS) is a typical subjective evaluation metric for speech
synthesis systems. Since collecting MOS is time-consuming, it would be
desirable if there are accurate MOS prediction models for automatic evaluation.
In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes
domain adaptive pre-training to further pre-train self-supervised learning
models on synthetic speech. And a proposed module is added to model the opinion
score distribution of each utterance. With the proposed components, DDOS
outperforms previous works on BVCC dataset. And the zero shot transfer result
on BC2019 dataset is significantly improved. DDOS also wins second place in
Interspeech 2022 VoiceMOS challenge in terms of system-level score.
Related papers
- MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech.
We propose a general framework for MOS prediction based on pair comparison (MOSPC)
Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.