Evaluation of Speech Representations for MOS prediction
- URL: http://arxiv.org/abs/2306.09979v1
- Date: Fri, 16 Jun 2023 17:21:42 GMT
- Title: Evaluation of Speech Representations for MOS prediction
- Authors: Frederico S. Oliveira, Edresson Casanova, Arnaldo C\^andido J\'unior,
Lucas R. S. Gris, Anderson S. Soares, and Arlindo R. Galv\~ao Filho
- Abstract summary: In this paper, we evaluate feature extraction models for predicting speech quality.
We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models.
- Score: 0.7329200485567826
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we evaluate feature extraction models for predicting speech
quality. We also propose a model architecture to compare embeddings of
supervised learning and self-supervised learning models with embeddings of
speaker verification models to predict the metric MOS. Our experiments were
performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called
BRSpeechMOS, which was created for this work. The results show that the Whisper
model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS
datasets. Among the supervised and self-supervised learning models using
BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and
the speaker verification model, SpeakerNet, had linear correlation of 0.6963.
Using VCC2018, the best supervised and self-supervised learning model,
Whisper-Large, achieved linear correlation of 0.7274, and the best model
speaker verification, TitaNet, achieved a linear correlation of 0.6933.
Although the results of the speaker verification models are slightly lower, the
SpeakerNet model has only 5M parameters, making it suitable for real-time
applications, and the TitaNet model produces an embedding of size 192, the
smallest among all the evaluated models. The experiment results are
reproducible with publicly available source-code1 .
Related papers
- What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance [0.0]
We evaluate several dataset sources, including child-directed speech (CHILDES), classic books (Gutenberg), synthetic data (TinyStories) and a mix of these across different model sizes.
Our experiments show that smaller models (e.g., GPT2-97M, GPT2-705M, Llama-360M) perform better when trained on more complex and rich datasets like Gutenberg.
arXiv Detail & Related papers (2024-11-11T02:37:21Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Raw waveform speaker verification for supervised and self-supervised
learning [30.08242210230669]
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification.
Under the best performing configuration, the model shows an equal error rate of 0.89%, competitive with state-of-the-art models.
We also explore the proposed model with a self-supervised learning framework and show the state-of-the-art performance in this line of research.
arXiv Detail & Related papers (2022-03-16T09:28:03Z) - Deep Learning Models for Knowledge Tracing: Review and Empirical
Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets.
The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - Normalizing Flow based Hidden Markov Models for Classification of Speech
Phones with Explainability [25.543231171094384]
In pursuit of explainability, we develop generative models for sequential data.
We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs)
The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach.
arXiv Detail & Related papers (2021-07-01T20:10:55Z) - ModelDiff: Testing-Based DNN Similarity Comparison for Model Reuse
Detection [9.106864924968251]
ModelDiff is a testing-based approach to deep learning model similarity comparison.
A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models.
arXiv Detail & Related papers (2021-06-11T15:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.