Transformer-Based Multi-Aspect Multi-Granularity Non-Native English
Speaker Pronunciation Assessment
- URL: http://arxiv.org/abs/2205.03432v1
- Date: Fri, 6 May 2022 18:07:44 GMT
- Title: Transformer-Based Multi-Aspect Multi-Granularity Non-Native English
Speaker Pronunciation Assessment
- Authors: Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass
- Abstract summary: We train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning.
Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.
- Score: 10.809349710149533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic pronunciation assessment is an important technology to help
self-directed language learners. While pronunciation quality has multiple
aspects including accuracy, fluency, completeness, and prosody, previous
efforts typically only model one aspect (e.g., accuracy) at one granularity
(e.g., at the phoneme-level). In this work, we explore modeling multi-aspect
pronunciation assessment at multiple granularities. Specifically, we train a
Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task
learning. Experiments show that GOPT achieves the best results on
speechocean762 with a public automatic speech recognition (ASR) acoustic model
trained on Librispeech.
Related papers
- Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment [21.12585023191302]
Automatic pronunciation assessment (APA) analyzes second-language (L2) learners' speech by providing fine-grained pronunciation feedback.<n>Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities.<n>We introduce multi-task pretraining (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues.
arXiv Detail & Related papers (2025-09-21T02:04:52Z) - CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing [5.466034990848432]
CUPE is a lightweight model that captures key phoneme features in just 120 milliseconds.<n> CUPE achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages.
arXiv Detail & Related papers (2025-08-21T07:27:10Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Hierarchical Pronunciation Assessment with Multi-Aspect Attention [3.6825890616838066]
We propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model.
HiPAMA hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention.
Remarkable improvements in the experimental results on the speachocean datasets demonstrate the robustness of HiPAMA.
arXiv Detail & Related papers (2022-11-15T12:49:35Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.