MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
- URL: http://arxiv.org/abs/2301.07087v2
- Date: Thu, 29 Jun 2023 06:33:58 GMT
- Title: MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
- Authors: Ond\v{r}ej Pl\'atek, Ond\v{r}ej Du\v{s}ek
- Abstract summary: We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS)
We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding.
We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances.
- Score: 3.42658286826597
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present MooseNet, a trainable speech metric that predicts the listeners'
Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic
Linear Discriminative Analysis (PLDA) generative model is used on top of an
embedding obtained from a self-supervised learning (SSL) neural network (NN)
model. We show that PLDA works well with a non-finetuned SSL model when trained
only on 136 utterances (ca. one minute training time) and that PLDA
consistently improves various neural MOS prediction models, even
state-of-the-art models with task-specific fine-tuning. Our ablation study
shows PLDA training superiority over SSL model fine-tuning in a low-resource
scenario. We also improve SSL model fine-tuning using a convenient optimizer
choice and additional contrastive and multi-task training objectives. The
fine-tuned MooseNet NN with the PLDA module achieves the best results,
surpassing the SSL baseline on the VoiceMOS Challenge data.
Related papers
- Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation [12.506633315768832]
HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
arXiv Detail & Related papers (2023-06-15T07:45:12Z) - LowDINO -- A Low Parameter Self Supervised Learning Model [0.0]
This research aims to explore the possibility of designing a neural network architecture that allows for small networks to adopt the properties of huge networks.
Previous studies have shown that using convolutional neural networks (ConvNets) can provide inherent inductive bias.
To reduce the number of parameters, attention mechanisms are utilized through the usage of MobileViT blocks.
arXiv Detail & Related papers (2023-05-28T18:34:59Z) - On Data Sampling Strategies for Training Neural Network Speech
Separation Models [26.94528951545861]
Speech separation is an important area of multi-speaker signal processing.
Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks.
Some of these models can take significant time to train and have high memory requirements.
Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood.
arXiv Detail & Related papers (2023-04-14T14:05:52Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - Towards Sustainable Self-supervised Learning [193.78876000005366]
We propose a Target-Enhanced Conditional (TEC) scheme which introduces two components to the existing mask-reconstruction based SSL.
First, we propose patch-relation enhanced targets which enhances the target given by base model and encourages the new model to learn semantic-relation knowledge from the base model.
Secondly, we introduce a conditional adapter that adaptively adjusts new model prediction to align with the target of different base models.
arXiv Detail & Related papers (2022-10-20T04:49:56Z) - Exploring Efficient-tuning Methods in Self-supervised Speech Models [53.633222197712875]
Self-supervised learning can learn powerful representations for different speech tasks.
In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained.
We show that the performance parity can be achieved with over 90% parameter reduction.
arXiv Detail & Related papers (2022-10-10T11:08:12Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - A Mixed Integer Programming Approach to Training Dense Neural Networks [0.0]
We propose novel mixed integer programming (MIP) formulations for training fully-connected ANNs.
Our formulations can account for both binary activation and rectified linear unit (ReLU) activation ANNs.
We also develop a layer-wise greedy approach, a technique adapted for reducing the number of layers in the ANN, for model pretraining.
arXiv Detail & Related papers (2022-01-03T15:53:51Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.