Speech MOS multi-task learning and rater bias correction
- URL: http://arxiv.org/abs/2212.01911v1
- Date: Sun, 4 Dec 2022 20:06:27 GMT
- Title: Speech MOS multi-task learning and rater bias correction
- Authors: Haleh Akrami, Hannes Gamper
- Abstract summary: Mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample.
Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model.
- Score: 10.123346550775471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Perceptual speech quality is an important performance metric for
teleconferencing applications. The mean opinion score (MOS) is standardized for
the perceptual evaluation of speech quality and is obtained by asking listeners
to rate the quality of a speech sample. Recently, there has been increasing
research interest in developing models for estimating MOS blindly. Here we
propose a multi-task framework to include additional labels and data in
training to improve the performance of a blind MOS estimation model.
Experimental results indicate that the proposed model can be trained to jointly
estimate MOS, reverberation time (T60), and clarity (C50) by combining two
disjoint data sets in training, one containing only MOS labels and the other
containing only T60 and C50 labels. Furthermore, we use a semi-supervised
framework to combine two MOS data sets in training, one containing only MOS
labels (per ITU-T Recommendation P.808), and the other containing separate
scores for speech signal, background noise, and overall quality (per ITU-T
Recommendation P.835). Finally, we present preliminary results for addressing
individual rater bias in the MOS labels.
Related papers
- Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech.
We propose a general framework for MOS prediction based on pair comparison (MOSPC)
Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.