Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification
- URL: http://arxiv.org/abs/2007.08267v2
- Date: Wed, 2 Dec 2020 07:56:32 GMT
- Title: Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning
With Spoofing Detection and Spoofing Type Classification
- Authors: Yeunju Choi, Youngmoon Jung, Hoirin Kim
- Abstract summary: We propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model.
Experiments using the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction.
- Score: 16.43844160498413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several studies have proposed deep-learning-based models to predict the mean
opinion score (MOS) of synthesized speech, showing the possibility of replacing
human raters. However, inter- and intra-rater variability in MOSs makes it hard
to ensure the high performance of the models. In this paper, we propose a
multi-task learning (MTL) method to improve the performance of a MOS prediction
model using the following two auxiliary tasks: spoofing detection (SD) and
spoofing type classification (STC). Besides, we use the focal loss to maximize
the synergy between SD and STC for MOS prediction. Experiments using the MOS
evaluation results of the Voice Conversion Challenge 2018 show that proposed
MTL with two auxiliary tasks improves MOS prediction. Our proposed model
achieves up to 11.6% relative improvement in performance over the baseline
model.
Related papers
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.
We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.
Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models [57.582219834039506]
We introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts.
It is based on the pre-existing dense checkpoints of our Skywork-13B model.
arXiv Detail & Related papers (2024-06-03T03:58:41Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - Speech MOS multi-task learning and rater bias correction [10.123346550775471]
Mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample.
Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model.
arXiv Detail & Related papers (2022-12-04T20:06:27Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Learning to Maximize Speech Quality Directly Using MOS Prediction for
Neural Text-to-Speech [15.796199345773873]
We propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss.
We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech.
The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation.
arXiv Detail & Related papers (2020-11-02T18:13:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.