Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling
- URL: http://arxiv.org/abs/2008.03710v1
- Date: Sun, 9 Aug 2020 11:14:19 GMT
- Title: Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling
- Authors: Yeunju Choi, Youngmoon Jung, Hoirin Kim
- Abstract summary: Several recent papers have proposed deep-learning-based assessment models.
We propose three models using cluster-based modeling methods.
We show that the GQT layer helps to predict human assessment better by automatically learning the task.
- Score: 16.43844160498413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While deep learning has made impressive progress in speech synthesis and
voice conversion, the assessment of the synthesized speech is still carried out
by human participants. Several recent papers have proposed deep-learning-based
assessment models and shown the potential to automate the speech quality
assessment. To improve the previously proposed assessment model, MOSNet, we
propose three models using cluster-based modeling methods: using a global
quality token (GQT) layer, using an Encoding Layer, and using both of them. We
perform experiments using the evaluation results of the Voice Conversion
Challenge 2018 to predict the mean opinion score of synthesized speech and
similarity score between synthesized speech and reference speech. The results
show that the GQT layer helps to predict human assessment better by
automatically learning the useful quality tokens for the task and that the
Encoding Layer helps to utilize frame-level scores more precisely.
Related papers
- Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - DDKtor: Automatic Diadochokinetic Speech Analysis [13.68342426889044]
This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech.
Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems.
The LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
arXiv Detail & Related papers (2022-06-29T13:34:03Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.