Improving Automatic Speech Recognition for Non-Native English with
Transfer Learning and Language Model Decoding
- URL: http://arxiv.org/abs/2202.05209v1
- Date: Thu, 10 Feb 2022 18:13:32 GMT
- Title: Improving Automatic Speech Recognition for Non-Native English with
Transfer Learning and Language Model Decoding
- Authors: Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed
- Abstract summary: We investigate fine-tuning of a pre-trained wav2vec 2.0 model citebaevski2020wav2vec,xu2021self under a rich set of L1 and L2 training conditions.
We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech, this does not hold for L2 speech.
- Score: 6.68194398006805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ASR systems designed for native English (L1) usually underperform on
non-native English (L2). To address this performance gap, \textbf{(i)} we
extend our previous work to investigate fine-tuning of a pre-trained wav2vec
2.0 model \cite{baevski2020wav2vec,xu2021self} under a rich set of L1 and L2
training conditions. We further \textbf{(ii)} incorporate language model
decoding in the ASR system, along with the fine-tuning method. Quantifying
gains acquired from each of these two approaches separately and an error
analysis allows us to identify different sources of improvement within our
models. We find that while the large self-trained wav2vec 2.0 may be
internalizing sufficient decoding knowledge for clean L1 speech
\cite{xu2021self}, this does not hold for L2 speech and accounts for the
utility of employing language model decoding on L2 data.
Related papers
- YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Learning Language-Specific Layers for Multilingual Machine Translation [1.997704019887898]
We introduce Language-Specific Transformer Layers (LSLs)
LSLs allow us to increase model capacity, while keeping the amount of computation and the number of parameters used in the forward pass constant.
We study the best way to place these layers using a neural architecture search inspired approach, and achieve an improvement of 1.3 chrF (1.5 spBLEU) points over not using LSLs on a separate decoder architecture, and 1.9 chrF (2.2 spBLEU) on a shared decoder one.
arXiv Detail & Related papers (2023-05-04T09:18:05Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Language-specific Characteristic Assistance for Code-switching Speech
Recognition [42.32330582682405]
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition.
Existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs.
We propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems.
arXiv Detail & Related papers (2022-06-29T13:39:51Z) - E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models.
We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2.
E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z) - Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 [7.378368959253632]
We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages.
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model.
arXiv Detail & Related papers (2021-10-07T15:29:22Z) - Speech Technology for Everyone: Automatic Speech Recognition for
Non-Native English with Transfer Learning [0.0]
We evaluate fine-tuning of pretrained wav2vec 2.0 models on L2-ARCTIC, a non-native English speech corpus.
Our experiments demonstrate the promise of developing ASR models for non-native English speakers.
arXiv Detail & Related papers (2021-10-01T23:11:00Z) - Regularized Training of Nearest Neighbor Language Models [10.994336081018043]
We build upon $k$NN-LM citepkhandelwal20generalization, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results.
We find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.
arXiv Detail & Related papers (2021-09-16T23:20:24Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.