Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model
- URL: http://arxiv.org/abs/2308.09262v3
- Date: Wed, 13 Mar 2024 13:15:11 GMT
- Title: Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model
- Authors: Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min
Wang and Yu Tsao
- Abstract summary: This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
- Score: 28.32514067707762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study proposes a multi-task pseudo-label learning (MPL)-based
non-intrusive speech quality assessment model called MTQ-Net. MPL consists of
two stages: obtaining pseudo-label scores from a pretrained model and
performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS),
Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The
pretrained MOSA-Net model is utilized to estimate three pseudo labels:
perceptual evaluation of speech quality (PESQ), short-time objective
intelligibility (STOI), and speech distortion index (SDI). Multi-task learning
is then employed to train MTQ-Net by combining a supervised loss (derived from
the difference between the estimated score and the ground-truth label) and a
semi-supervised loss (derived from the difference between the estimated score
and the pseudo label), where the Huber loss is employed as the loss function.
Experimental results first demonstrate the advantages of MPL compared to
training a model from scratch and using a direct knowledge transfer mechanism.
Second, the benefit of the Huber loss for improving the predictive ability of
MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher
overall predictive power compared to other SSL-based speech assessment models.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - MTI-Net: A Multi-Target Speech Intelligibility Prediction Model [25.124218779681875]
This study proposes a multi-task speech intelligibility prediction model, called MTI-Net, for simultaneously predicting human and machine intelligibility measures.
Specifically, given a speech utterance, MTI-Net is designed to predict subjective listening test results and word error rate (WER) scores.
arXiv Detail & Related papers (2022-04-07T09:17:04Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z) - Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification [16.43844160498413]
We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
arXiv Detail & Related papers (2020-07-16T11:38:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.