Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction
- URL: http://arxiv.org/abs/2312.15616v1
- Date: Mon, 25 Dec 2023 05:35:28 GMT
- Title: Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction
- Authors: Aditya Ravuri, Erica Cooper, Junichi Yamagishi
- Abstract summary: This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
- Score: 40.51248841706311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting audio quality in voice synthesis and conversion systems is a
critical yet challenging task, especially when traditional methods like Mean
Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses
the gap in efficient audio quality prediction, especially in low-resource
settings where extensive MOS data from large-scale listening tests may be
unavailable. We demonstrate that uncertainty measures derived from
out-of-the-box pretrained self-supervised learning (SSL) models, such as
wav2vec, correlate with MOS scores. These findings are based on data from the
2022 and 2023 VoiceMOS challenges. We explore the extent of this correlation
across different models and language contexts, revealing insights into how
inherent uncertainties in SSL models can serve as effective proxies for audio
quality assessment. In particular, we show that the contrastive wav2vec models
are the most performant in all settings.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.
The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.
We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction
in Text-to-Speech for Low-Resource Languages [1.1852406625172218]
We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS.
Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot prediction.
arXiv Detail & Related papers (2023-05-30T20:19:56Z) - Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using
Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models.
We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems.
All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z) - Comparison of Speech Representations for the MOS Prediction System [1.2949520455740093]
We conduct experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges.
We find that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable.
arXiv Detail & Related papers (2022-06-28T08:18:18Z) - Improving Self-Supervised Learning-based MOS Prediction Networks [0.0]
The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model.
We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers.
The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.
arXiv Detail & Related papers (2022-04-23T09:19:16Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Membership Inference Attacks Against Self-supervised Speech Models [62.73937175625953]
Self-supervised learning (SSL) on continuous speech has started gaining attention.
We present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access.
arXiv Detail & Related papers (2021-11-09T13:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.