Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning
- URL: http://arxiv.org/abs/2204.03863v1
- Date: Fri, 8 Apr 2022 06:13:55 GMT
- Title: Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning
- Authors: Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim
- Abstract summary: We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
- Score: 13.391307807956673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT
models have shown promising results in various downstream tasks in the speech
community. In particular, speech representations learned by SSL models have
been shown to be effective for encoding various speech-related characteristics.
In this context, we propose a novel automatic pronunciation assessment method
based on SSL models. First, the proposed method fine-tunes the pre-trained SSL
models with connectionist temporal classification to adapt the English
pronunciation of English-as-a-second-language (ESL) learners in a data
environment. Then, the layer-wise contextual representations are extracted from
all across the transformer layers of the SSL models. Finally, the automatic
pronunciation score is estimated using bidirectional long short-term memory
with the layer-wise contextual representations and the corresponding text. We
show that the proposed SSL model-based methods outperform the baselines, in
terms of the Pearson correlation coefficient, on datasets of Korean ESL learner
children and Speechocean762. Furthermore, we analyze how different
representations of transformer layers in the SSL model affect the performance
of the pronunciation assessment task.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Self-Supervised Learning of Graph Neural Networks: A Unified Review [50.71341657322391]
Self-supervised learning is emerging as a new paradigm for making use of large amounts of unlabeled samples.
We provide a unified review of different ways of training graph neural networks (GNNs) using SSL.
Our treatment of SSL methods for GNNs sheds light on the similarities and differences of various methods, setting the stage for developing new methods and algorithms.
arXiv Detail & Related papers (2021-02-22T03:43:45Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.