Related papers: Speech MOS multi-task learning and rater bias correction

Speech MOS multi-task learning and rater bias correction

URL: http://arxiv.org/abs/2212.01911v1
Date: Sun, 4 Dec 2022 20:06:27 GMT
Title: Speech MOS multi-task learning and rater bias correction
Authors: Haleh Akrami, Hannes Gamper
Abstract summary: Mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model.
Score: 10.123346550775471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Perceptual speech quality is an important performance metric for teleconferencing applications. The mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Recently, there has been increasing research interest in developing models for estimating MOS blindly. Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model. Experimental results indicate that the proposed model can be trained to jointly estimate MOS, reverberation time (T60), and clarity (C50) by combining two disjoint data sets in training, one containing only MOS labels and the other containing only T60 and C50 labels. Furthermore, we use a semi-supervised framework to combine two MOS data sets in training, one containing only MOS labels (per ITU-T Recommendation P.808), and the other containing separate scores for speech signal, background noise, and overall quality (per ITU-T Recommendation P.835). Finally, we present preliminary results for addressing individual rater bias in the MOS labels.

Related papers

JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs [0.0]
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality.<n>We propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction.<n> Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining.
arXiv Detail & Related papers (2025-07-15T18:16:46Z)
MMMOS: Multi-domain Multi-axis Audio Quality Assessment [49.48516314472825]
Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech.<n>We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.<n> MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's tau versus baseline.
arXiv Detail & Related papers (2025-07-05T16:42:09Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings. We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z)
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z)
MOSPC: MOS Prediction Based on Pairwise Comparison [32.55704173124071]
Mean opinion score (MOS) is a subjective metric to evaluate the quality of synthesized speech. We propose a general framework for MOS prediction based on pair comparison (MOSPC) Our framework surpasses the strong baseline in ranking accuracy on each fine-grained segment.
arXiv Detail & Related papers (2023-06-18T07:38:17Z)
Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers. The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z)
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z)
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition. Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction. Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model. Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.