Mispronunciation detection using self-supervised speech representations
- URL: http://arxiv.org/abs/2307.16324v1
- Date: Sun, 30 Jul 2023 21:20:58 GMT
- Title: Mispronunciation detection using self-supervised speech representations
- Authors: Jazmin Vidal, Pablo Riera and Luciana Ferrer
- Abstract summary: We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
- Score: 10.010024759851142
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, self-supervised learning (SSL) models have produced
promising results in a variety of speech-processing tasks, especially in
contexts of data scarcity. In this paper, we study the use of SSL models for
the task of mispronunciation detection for second language learners. We compare
two downstream approaches: 1) training the model for phone recognition (PR)
using native English data, and 2) training a model directly for the target task
using non-native English data. We compare the performance of these two
approaches for various SSL representations as well as a representation
extracted from a traditional DNN-based speech recognition model. We evaluate
the models on L2Arctic and EpaDB, two datasets of non-native speech annotated
with pronunciation labels at the phone level. Overall, we find that using a
downstream model trained for the target task gives the best performance and
that most upstream models perform similarly for the task.
Related papers
- Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - ProsAudit, a prosodic benchmark for self-supervised speech models [14.198508548718676]
ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning (SSL) speech models.
It consists of two subtasks, their corresponding metrics, and an evaluation dataset.
arXiv Detail & Related papers (2023-02-23T14:30:23Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z) - A Hierarchical Model for Spoken Language Recognition [29.948719321162883]
Spoken language recognition ( SLR) refers to the automatic process used to determine the language present in a speech sample.
We propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster.
We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages.
arXiv Detail & Related papers (2022-01-04T22:10:36Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.