Speech Technology for Everyone: Automatic Speech Recognition for
Non-Native English with Transfer Learning
- URL: http://arxiv.org/abs/2110.00678v1
- Date: Fri, 1 Oct 2021 23:11:00 GMT
- Title: Speech Technology for Everyone: Automatic Speech Recognition for
Non-Native English with Transfer Learning
- Authors: Toshiko Shibano (1), Xinyi Zhang (1), Mia Taige Li (1), Haejin Cho
(1), Peter Sullivan (1), Muhammad Abdul-Mageed (1) ((1) University of British
Columbia)
- Abstract summary: We evaluate fine-tuning of pretrained wav2vec 2.0 models on L2-ARCTIC, a non-native English speech corpus.
Our experiments demonstrate the promise of developing ASR models for non-native English speakers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To address the performance gap of English ASR models on L2 English speakers,
we evaluate fine-tuning of pretrained wav2vec 2.0 models (Baevski et al., 2020;
Xu et al., 2021) on L2-ARCTIC, a non-native English speech corpus (Zhao et al.,
2018) under different training settings. We compare \textbf{(a)} models trained
with a combination of diverse accents to ones trained with only specific
accents and \textbf{(b)} results from different single-accent models. Our
experiments demonstrate the promise of developing ASR models for non-native
English speakers, even with small amounts of L2 training data and even without
a language model. Our models also excel in the zero-shot setting where we train
on multiple L2 datasets and test on a blind L2 test set.
Related papers
- Automatic Speech Recognition for the Ika Language [0.0]
We fine-tune the pretrained wav2vec 2.0 Massively translations Speech Models on a high-quality speech dataset compiled from New Testament Bible Multilingual in Ika.
Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data.
arXiv Detail & Related papers (2024-10-01T11:56:42Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Pretraining Approaches for Spoken Language Recognition: TalTech
Submission to the OLR 2021 Challenge [0.0]
The paper is based on our submission to the Oriental Language Recognition 2021 Challenge.
For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition.
For the unconstrained task, we relied on both externally available pretrained models as well as external data.
arXiv Detail & Related papers (2022-05-14T15:17:08Z) - Improving Automatic Speech Recognition for Non-Native English with
Transfer Learning and Language Model Decoding [6.68194398006805]
We investigate fine-tuning of a pre-trained wav2vec 2.0 model citebaevski2020wav2vec,xu2021self under a rich set of L1 and L2 training conditions.
We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech, this does not hold for L2 speech.
arXiv Detail & Related papers (2022-02-10T18:13:32Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Scalable Multilingual Frontend for TTS [4.1203601403593275]
This paper describes progress towards making a Neural Text-to-Speech (TTS) Frontend that works for many languages and can be easily extended to new languages.
We take a Machine Translation inspired approach to constructing, and model both text normalization and pronunciation on a sentence level by building and using sequence-to-sequence (S2S) models.
For our language-independent approach to pronunciation we do not use a lexicon. Instead all pronunciations, including context-based pronunciations, are captured in the S2S model.
arXiv Detail & Related papers (2020-04-10T08:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.