LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech
- URL: http://arxiv.org/abs/2110.09103v1
- Date: Mon, 18 Oct 2021 08:52:31 GMT
- Title: LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech
- Authors: Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda
- Abstract summary: We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
- Score: 67.88748572167309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An effective approach to automatically predict the subjective rating for
synthetic speech is to train on a listening test dataset with human-annotated
scores. Although each speech sample in the dataset is rated by several
listeners, most previous works only used the mean score as the training target.
In this work, we present LDNet, a unified framework for mean opinion score
(MOS) prediction that predicts the listener-wise perceived quality given the
input speech and the listener identity. We reflect recent advances in LD
modeling, including design choices of the model architecture, and propose two
inference methods that provide more stable results and efficient computation.
We conduct systematic experiments on the voice conversion challenge (VCC) 2018
benchmark and a newly collected large-scale MOS dataset, providing an in-depth
analysis of the proposed framework. Results show that the mean listener
inference method is a better way to utilize the mean scores, whose
effectiveness is more obvious when having more ratings per sample.
Related papers
- A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z) - Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling [16.43844160498413]
Several recent papers have proposed deep-learning-based assessment models.
We propose three models using cluster-based modeling methods.
We show that the GQT layer helps to predict human assessment better by automatically learning the task.
arXiv Detail & Related papers (2020-08-09T11:14:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.