ASR2K: Speech Recognition for Around 2000 Languages without Audio
- URL: http://arxiv.org/abs/2209.02842v1
- Date: Tue, 6 Sep 2022 22:48:29 GMT
- Title: ASR2K: Speech Recognition for Around 2000 Languages without Audio
- Authors: Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji
Watanabe
- Abstract summary: We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
- Score: 100.41158814934802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most recent speech recognition models rely on large supervised datasets,
which are unavailable for many low-resource languages. In this work, we present
a speech recognition pipeline that does not require any audio for the target
language. The only assumption is that we have access to raw text datasets or a
set of n-gram statistics. Our speech pipeline consists of three components:
acoustic, pronunciation, and language models. Unlike the standard pipeline, our
acoustic and pronunciation models use multilingual models without any
supervision. The language model is built using n-gram statistics or the raw
text dataset. We build speech recognition for 1909 languages by combining it
with Crubadan: a large endangered languages n-gram database. Furthermore, we
test our approach on 129 languages across two datasets: Common Voice and CMU
Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset
with Crubadan statistics only and improve them to 45% CER and 69% WER when
using 10000 raw text utterances.
Related papers
- IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS [0.9092013845117769]
IndicVoices-R (IV-R) is the largest multilingual Indian TTS dataset derived from an ASR dataset.
IV-R matches the quality of gold-standard TTS datasets like LJ,Speech LibriTTS, and IndicTTS.
We release the first TTS model for all 22 official Indian languages.
arXiv Detail & Related papers (2024-09-09T06:28:47Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages [20.25236081418051]
Zambezi Voice is an open-source multilingual speech resource for Zambian languages.
To our knowledge, this is the first multilingual speech dataset created for Zambian languages.
arXiv Detail & Related papers (2023-06-07T13:36:37Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Bengali Common Voice Speech Dataset for Automatic Speech Recognition [0.9218853132156671]
Bengali is one of the most spoken languages in the world with over 300 million speakers globally.
Despite its popularity, research into the development of Bengali speech recognition systems is hindered due to the lack of diverse open-source datasets.
We present insights obtained from the dataset and discuss key linguistic challenges that need to be addressed in future versions.
arXiv Detail & Related papers (2022-06-28T14:52:08Z) - SpeechStew: Simply Mix All Available Speech Recognition Data to Train
One Large Neural Network [45.59907668722702]
We present SpeechStew, a speech recognition model that is trained on a combination of publicly available speech recognition datasets.
Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ.
We also demonstrate that SpeechStew learns powerful transfer learning representations.
arXiv Detail & Related papers (2021-04-05T20:13:36Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.