Predicting pairwise preferences between TTS audio stimuli using parallel
ratings data and anti-symmetric twin neural networks
- URL: http://arxiv.org/abs/2209.11003v1
- Date: Thu, 22 Sep 2022 13:34:22 GMT
- Title: Predicting pairwise preferences between TTS audio stimuli using parallel
ratings data and anti-symmetric twin neural networks
- Authors: Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin
Richmond, Gustav Eje Henter
- Abstract summary: We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores.
To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other.
Our results compare favourably to a state-of-the-art model trained to predict MOS scores.
- Score: 24.331098975217596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically predicting the outcome of subjective listening tests is a
challenging task. Ratings may vary from person to person even if preferences
are consistent across listeners. While previous work has focused on predicting
listeners' ratings (mean opinion scores) of individual stimuli, we focus on the
simpler task of predicting subjective preference given two speech stimuli for
the same text. We propose a model based on anti-symmetric twin neural networks,
trained on pairs of waveforms and their corresponding preference scores. We
explore both attention and recurrent neural nets to account for the fact that
stimuli in a pair are not time aligned. To obtain a large training set we
convert listeners' ratings from MUSHRA tests to values that reflect how often
one stimulus in the pair was rated higher than the other. Specifically, we
evaluate performance on data obtained from twelve MUSHRA evaluations conducted
over five years, containing different TTS systems, built from data of different
speakers. Our results compare favourably to a state-of-the-art model trained to
predict MOS scores.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.
We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.
Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - The Performance Evaluation of Attention-Based Neural ASR under Mixed
Speech Input [1.776746672434207]
We present mixtures of speech signals to a popular attention-based neural ASR, known as Listen, Attend, and Spell (LAS)
In particular, we investigate in details when two phonemes are mixed what will be the predicted phoneme.
Our results show the model, when presented with mixed phonemes signals, tend to predict those that have higher accuracies.
arXiv Detail & Related papers (2021-08-03T02:08:22Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Comparison of Speech Representations for Automatic Quality Estimation in
Multi-Speaker Text-to-Speech Synthesis [21.904558308567122]
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech synthesis.
We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings.
arXiv Detail & Related papers (2020-02-28T10:44:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.