Effectiveness of Mining Audio and Text Pairs from Public Data for
Improving ASR Systems for Low-Resource Languages
- URL: http://arxiv.org/abs/2208.12666v1
- Date: Fri, 26 Aug 2022 13:37:45 GMT
- Title: Effectiveness of Mining Audio and Text Pairs from Public Data for
Improving ASR Systems for Low-Resource Languages
- Authors: Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth
Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Abstract summary: We create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages.
On average, Shrutilipi results in a 2.3x increase over publicly available labelled data.
We show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5.8% for 7 languages.
- Score: 15.214673043019395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) models have become the default choice for state-of-the-art
speech recognition systems. Such models are trained on large amounts of
labelled data, which are often not available for low-resource languages.
Techniques such as self-supervised learning and transfer learning hold promise,
but have not yet been effective in training accurate models. On the other hand,
collecting labelled datasets on a diverse set of domains and speakers is very
expensive. In this work, we demonstrate an inexpensive and effective
alternative to these approaches by ``mining'' text and audio pairs for Indian
languages from public sources, specifically from the public archives of All
India Radio. As a key component, we adapt the Needleman-Wunsch algorithm to
align sentences with corresponding audio segments given a long audio and a PDF
of its transcript, while being robust to errors due to OCR, extraneous text,
and non-transcribed speech. We thus create Shrutilipi, a dataset which contains
over 6,400 hours of labelled audio across 12 Indian languages totalling to
4.95M sentences. On average, Shrutilipi results in a 2.3x increase over
publicly available labelled data. We establish the quality of Shrutilipi with
21 human evaluators across the 12 languages. We also establish the diversity of
Shrutilipi in terms of represented regions, speakers, and mentioned named
entities. Significantly, we show that adding Shrutilipi to the training set of
Wav2Vec models leads to an average decrease in WER of 5.8\% for 7 languages on
the IndicSUPERB benchmark. For Hindi, which has the most benchmarks (7), the
average WER falls from 18.8% to 13.5%. This improvement extends to efficient
models: We show a 2.3% drop in WER for a Conformer model (10x smaller than
Wav2Vec). Finally, we demonstrate the diversity of Shrutilipi by showing that
the model trained with it is more robust to noisy input.
Related papers
- GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.