Related papers: DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

URL: http://arxiv.org/abs/2204.03219v1
Date: Thu, 7 Apr 2022 05:04:10 GMT
Title: DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores
Authors: Wei-Cheng Tseng, Wei-Tsung Kao, Hung-yi Lee
Abstract summary: Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems. In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech.
Score: 64.37977826069105
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems. Since collecting MOS is time-consuming, it would be desirable if there are accurate MOS prediction models for automatic evaluation. In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech. And a proposed module is added to model the opinion score distribution of each utterance. With the proposed components, DDOS outperforms previous works on BVCC dataset. And the zero shot transfer result on BC2019 dataset is significantly improved. DDOS also wins second place in Interspeech 2022 VoiceMOS challenge in terms of system-level score.

Related papers

From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling [66.22134521383909]
We introduce a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting.<n>Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling.<n>Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences.<n> Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases.
arXiv Detail & Related papers (2025-10-01T10:27:51Z)
MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech. We propose a general framework for MOS prediction based on pair comparison (MOSPC) Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z)
Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z)
Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers. The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z)
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model. Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
Semi-Supervised Models via Data Augmentationfor Classifying Interactive Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses. For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process. For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.