Using Rater and System Metadata to Explain Variance in the VoiceMOS
Challenge 2022 Dataset
- URL: http://arxiv.org/abs/2209.06358v1
- Date: Wed, 14 Sep 2022 00:45:49 GMT
- Title: Using Rater and System Metadata to Explain Variance in the VoiceMOS
Challenge 2022 Dataset
- Authors: Michael Chinen, Jan Skoglund, Chandan K A Reddy, Alessandro Ragano,
Andrew Hines
- Abstract summary: The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels.
This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset.
- Score: 71.93633698146002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Non-reference speech quality models are important for a growing number of
applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice
conversion and text-to-speech samples with subjective labels. This study looks
at the amount of variance that can be explained in subjective ratings of speech
quality from metadata and the distribution imbalances of the dataset. Speech
quality models were constructed using wav2vec 2.0 with additional metadata
features that included rater groups and system identifiers and obtained
competitive metrics including a Spearman rank correlation coefficient (SRCC) of
0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the
utterance-level. Using data and metadata that the test restricted or blinded
further improved the metrics. A metadata analysis showed that the system-level
metrics do not represent the model's system-level prediction as a result of the
wide variation in the number of utterances used for each system on the
validation and test datasets. We conclude that, in general, conditions should
have enough utterances in the test set to bound the sample mean error, and be
relatively balanced in utterance count between systems, otherwise the
utterance-level metrics may be more reliable and interpretable.
Related papers
- Scaling Parameter-Constrained Language Models with Quality Data [32.35610029333478]
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters.
We extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation.
arXiv Detail & Related papers (2024-10-04T02:07:17Z) - Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - Investigating model performance in language identification: beyond
simple error statistics [28.128924654154087]
Language development experts need tools that can automatically identify languages from fluent, conversational speech.
We investigate how well a number of language identification systems perform on individual recordings and speech units with different linguistic properties.
arXiv Detail & Related papers (2023-05-30T10:32:53Z) - CCATMos: Convolutional Context-aware Transformer Network for
Non-intrusive Speech Quality Assessment [12.497279501767606]
We propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters.
We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge.
arXiv Detail & Related papers (2022-11-04T16:46:11Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.