Towards Building ASR Systems for the Next Billion Users
- URL: http://arxiv.org/abs/2111.03945v1
- Date: Sat, 6 Nov 2021 19:34:33 GMT
- Title: Towards Building ASR Systems for the Next Billion Users
- Authors: Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh
Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Abstract summary: We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
- Score: 15.867823754118422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent methods in speech and language technology pretrain very LARGE models
which are fine-tuned for specific tasks. However, the benefits of such LARGE
models are often limited to a few resource rich languages of the world. In this
work, we make multiple contributions towards building ASR systems for low
resource languages from the Indian subcontinent. First, we curate 17,000 hours
of raw speech data for 40 Indian languages from a wide variety of domains
including education, news, technology, and finance. Second, using this raw
speech data we pretrain several variants of wav2vec style models for 40 Indian
languages. Third, we analyze the pretrained models to find key features:
codebook vectors of similar sounding phonemes are shared across languages,
representations across layers are discriminative of the language family, and
attention heads often pay attention within small local windows. Fourth, we
fine-tune this model for downstream ASR for 9 languages and obtain
state-of-the-art results on 3 public datasets, including on very low-resource
languages such as Sinhala and Nepali. Our work establishes that multilingual
pretraining is an effective strategy for building ASR systems for the
linguistically diverse speakers of the Indian subcontinent.
Related papers
- LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Model Adaptation for ASR in low-resource Indian Languages [28.02064068964355]
Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models like wav2vec2 and large-scale multi-lingual training like Whisper.
A huge challenge still exists for low-resource languages where the availability of both audio and text is limited.
This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages.
It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora
arXiv Detail & Related papers (2023-07-16T05:25:51Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages.
We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - A Survey of Multilingual Models for Automatic Speech Recognition [6.657361001202456]
Cross-lingual transfer is an attractive solution to the problem of multilingual Automatic Speech Recognition.
Recent advances in Self Supervised Learning are opening up avenues for unlabeled speech data to be used in multilingual ASR models.
We present best practices for building multilingual models from research across diverse languages and techniques.
arXiv Detail & Related papers (2022-02-25T09:31:40Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.