MOSPC: MOS Prediction Based on Pairwise Comparison
- URL: http://arxiv.org/abs/2306.10493v1
- Date: Sun, 18 Jun 2023 07:38:17 GMT
- Title: MOSPC: MOS Prediction Based on Pairwise Comparison
- Authors: Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang
- Abstract summary: Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech.
We propose a general framework for MOS prediction based on pair comparison (MOSPC)
Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
- Score: 32.55704173124071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a subjective metric to evaluate the quality of synthesized speech, Mean
opinion score~(MOS) usually requires multiple annotators to score the same
speech. Such an annotation approach requires a lot of manpower and is also
time-consuming. MOS prediction model for automatic evaluation can significantly
reduce labor cost. In previous works, it is difficult to accurately rank the
quality of speech when the MOS scores are close. However, in practical
applications, it is more important to correctly rank the quality of synthesis
systems or sentences than simply predicting MOS scores. Meanwhile, as each
annotator scores multiple audios during annotation, the score is probably a
relative value based on the first or the first few speech scores given by the
annotator. Motivated by the above two points, we propose a general framework
for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup
algorithm to enhance the generalization performance of MOSPC. The experiments
on BVCC and VCC2018 show that our framework outperforms the baselines on most
of the correlation coefficient metrics, especially on the metric KTAU related
to quality ranking. And our framework also surpasses the strong baseline in
ranking accuracy on each fine-grained segment. These results indicate that our
framework contributes to improving the ranking accuracy of speech quality.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z) - Learning with Noisy Low-Cost MOS for Image Quality Assessment via
Dual-Bias Calibration [20.671990508960906]
In view of the subjective bias of individual annotators, the labor-abundant mean opinion score (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image.
In this paper, we aim to learn robust IQA models from low-cost MOS, which only requires very few opinion scores or even a single opinion score for each image.
To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels.
arXiv Detail & Related papers (2023-11-27T14:11:54Z) - Speech MOS multi-task learning and rater bias correction [10.123346550775471]
Mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample.
Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model.
arXiv Detail & Related papers (2022-12-04T20:06:27Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - Comparison of Speech Representations for the MOS Prediction System [1.2949520455740093]
We conduct experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges.
We find that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable.
arXiv Detail & Related papers (2022-06-28T08:18:18Z) - DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training
and Distribution of Opinion Scores [64.37977826069105]
Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems.
In this work, we propose DDOS, a novel MOS prediction model.
DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech.
arXiv Detail & Related papers (2022-04-07T05:04:10Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.