Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition
- URL: http://arxiv.org/abs/2306.02534v1
- Date: Mon, 5 Jun 2023 01:55:33 GMT
- Title: Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition
- Authors: Jisung Wang, Haram Lee, Myungwoo Oh
- Abstract summary: This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
- Score: 2.8360662552057323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The limited availability of non-native speech datasets presents a major
challenge in automatic speech recognition (ASR) to narrow the performance gap
between native and non-native speakers. To address this, the focus of this
study is on the efficient incorporation of the L2 phonemes, which in this work
refer to Korean phonemes, through articulatory feature analysis. This not only
enables accurate modeling of pronunciation variants but also allows for the
utilization of both native Korean and English speech datasets. We employ the
lattice-free maximum mutual information (LF-MMI) objective in an end-to-end
manner, to train the acoustic model to align and predict one of multiple
pronunciation candidates. Experimental results show that the proposed method
improves ASR accuracy for Korean L2 speech by training solely on L1 speech
data. Furthermore, fine-tuning on L2 speech improves recognition accuracy for
both L1 and L2 speech without performance trade-offs.
Related papers
- A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker's Shadowings [12.29892010056753]
An ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of utterances.
This pilot study utilizes a unique semi-parallel dataset composed of non-native speakers' (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances.
We explore the technical possibility of replicating the process of an L1 speaker's shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system.
arXiv Detail & Related papers (2024-10-03T06:24:56Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment [5.624555343386606]
Second language (L2) learners can improve their pronunciation by imitating golden speech.
This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners.
arXiv Detail & Related papers (2024-09-11T09:55:57Z) - Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition
and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages.
We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics.
We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z) - L1-aware Multilingual Mispronunciation Detection Framework [10.15106073866792]
This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation.
An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence.
Experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets.
arXiv Detail & Related papers (2023-09-14T13:53:17Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Generative Adversarial Training Data Adaptation for Very Low-resource
Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech.
We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.