Zero-Shot Automatic Pronunciation Assessment
- URL: http://arxiv.org/abs/2305.19563v1
- Date: Wed, 31 May 2023 05:17:17 GMT
- Title: Zero-Shot Automatic Pronunciation Assessment
- Authors: Hongfu Liu, Mingqian Shi, Ye Wang
- Abstract summary: We propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT.
Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines.
- Score: 19.971348810774046
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic Pronunciation Assessment (APA) is vital for computer-assisted
language learning. Prior methods rely on annotated speech-text data to train
Automatic Speech Recognition (ASR) models or speech-score data to train
regression models. In this work, we propose a novel zero-shot APA method based
on the pre-trained acoustic model, HuBERT. Our method involves encoding speech
input and corrupting them via a masking module. We then employ the Transformer
encoder and apply k-means clustering to obtain token sequences. Finally, a
scoring module is designed to measure the number of wrongly recovered tokens.
Experimental results on speechocean762 demonstrate that the proposed method
achieves comparable performance to supervised regression baselines and
outperforms non-regression baselines in terms of Pearson Correlation
Coefficient (PCC). Additionally, we analyze how masking strategies affect the
performance of APA.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal
Classification Paradigm [0.0]
We propose a new balanced paradigm for e-WER in a classification setting.
Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER.
The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.
arXiv Detail & Related papers (2021-01-14T07:26:28Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.