Comparison of Speech Representations for the MOS Prediction System
- URL: http://arxiv.org/abs/2206.13817v1
- Date: Tue, 28 Jun 2022 08:18:18 GMT
- Title: Comparison of Speech Representations for the MOS Prediction System
- Authors: Aki Kunikoshi, Jaebok Kim, Wonsuk Jun and K\r{a}re Sj\"olander
(ReadSpeaker)
- Abstract summary: We conduct experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges.
We find that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable.
- Score: 1.2949520455740093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic methods to predict Mean Opinion Score (MOS) of listeners have been
researched to assure the quality of Text-to-Speech systems. Many previous
studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture
relations between spectral features and MOS in a more effective way and
achieved high accuracy. However, the optimal representation in terms of
generalization capability still largely remains unknown. To this end, we
compare the performance of Self-Supervised Learning (SSL) features obtained by
the wav2vec framework to that of spectral features such as magnitude of
spectrogram and melspectrogram. Moreover, we propose to combine the SSL
features and features which we believe to retain essential information to the
automatic MOS to compensate each other for their drawbacks. We conduct
comprehensive experiments on a large-scale listening test corpus collected from
past Blizzard and Voice Conversion Challenges. We found that the wav2vec
feature set showed the best generalization even though the given ground-truth
was not always reliable. Furthermore, we found that the combinations performed
the best and analyzed how they bridged the gap between spectral and the wav2vec
feature sets.
Related papers
- The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech [28.168242593106566]
We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.
Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech.
In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below.
arXiv Detail & Related papers (2024-09-14T05:03:18Z) - Uncertainty as a Predictor: Leveraging Self-Supervised Learning for
Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings.
We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z) - Comparative Analysis of the wav2vec 2.0 Feature Extractor [42.18541127866435]
We study the capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model.
We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components.
arXiv Detail & Related papers (2023-08-08T14:29:35Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.