Improving Mispronunciation Detection with Wav2vec2-based Momentum
Pseudo-Labeling for Accentedness and Intelligibility Assessment
- URL: http://arxiv.org/abs/2203.15937v1
- Date: Tue, 29 Mar 2022 22:40:31 GMT
- Title: Improving Mispronunciation Detection with Wav2vec2-based Momentum
Pseudo-Labeling for Accentedness and Intelligibility Assessment
- Authors: Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L.
Hansen
- Abstract summary: Current mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition.
One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech.
We leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models.
- Score: 28.76055994423364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current leading mispronunciation detection and diagnosis (MDD) systems
achieve promising performance via end-to-end phoneme recognition. One challenge
of such end-to-end solutions is the scarcity of human-annotated phonemes on
natural L2 speech. In this work, we leverage unlabeled L2 speech via a
pseudo-labeling (PL) procedure and extend the fine-tuning approach based on
pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec
2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples
plus the created pseudo-labeled L2 speech samples. Our pseudo labels are
dynamic and are produced by an ensemble of the online model on-the-fly, which
ensures that our model is robust to pseudo label noise. We show that
fine-tuning with pseudo labels gains a 5.35% phoneme error rate reduction and
2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning
baseline. The proposed PL method is also shown to outperform conventional
offline PL methods. Compared to the state-of-the-art MDD systems, our MDD
solution achieves a more accurate and consistent phonetic error diagnosis. In
addition, we conduct an open test on a separate UTD-4Accents dataset, where our
system recognition outputs show a strong correlation with human perception,
based on accentedness and intelligibility.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source
Localization [9.791311361007397]
We propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation.
XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias.
arXiv Detail & Related papers (2024-03-05T16:28:48Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation [11.91301106502376]
SpeechBlender is a fine-grained data augmentation pipeline for generating mispronunciation errors.
Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models.
arXiv Detail & Related papers (2022-11-02T07:13:30Z) - Neighborhood Collective Estimation for Noisy Label Identification and
Correction [92.20697827784426]
Learning with noisy labels (LNL) aims at designing strategies to improve model performance and generalization by mitigating the effects of model overfitting to noisy labels.
Recent advances employ the predicted label distributions of individual samples to perform noise verification and noisy label correction, easily giving rise to confirmation bias.
We propose Neighborhood Collective Estimation, in which the predictive reliability of a candidate sample is re-estimated by contrasting it against its feature-space nearest neighbors.
arXiv Detail & Related papers (2022-08-05T14:47:22Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - L2B: Learning to Bootstrap Robust Models for Combating Label Noise [52.02335367411447]
This paper introduces a simple and effective method, named Learning to Bootstrap (L2B)
It enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels.
It achieves this by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning.
arXiv Detail & Related papers (2022-02-09T05:57:08Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - A Second-Order Approach to Learning with Instance-Dependent Label Noise [58.555527517928596]
The presence of label noise often misleads the training of deep neural networks.
We show that the errors in human-annotated labels are more likely to be dependent on the difficulty levels of tasks.
arXiv Detail & Related papers (2020-12-22T06:36:58Z) - An End-to-End Mispronunciation Detection System for L2 English Speech
Leveraging Novel Anti-Phone Modeling [11.894724235336872]
Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT)
We propose to conduct MDD with a novel end- to-end automatic speech recognition (E2E-based ASR) approach.
In particular, we expand the original L2 phone set with their corresponding anti-phone set, aiming to provide better mispronunciation detection and diagnosis feedback.
arXiv Detail & Related papers (2020-05-25T07:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.