MuRIL: Multilingual Representations for Indian Languages
- URL: http://arxiv.org/abs/2103.10730v1
- Date: Fri, 19 Mar 2021 11:06:37 GMT
- Title: MuRIL: Multilingual Representations for Indian Languages
- Authors: Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee
Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu,
Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian,
Partha Talukdar
- Abstract summary: India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country.
Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages.
We propose MuRIL, a multilingual language model specifically built for IN languages.
- Score: 3.529875637780551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: India is a multilingual society with 1369 rationalized languages and dialects
being spoken across the country (INDIA, 2011). Of these, the 22 scheduled
languages have a staggering total of 1.17 billion speakers and 121 languages
have more than 10,000 speakers (INDIA, 2011). India also has the second largest
(and an ever growing) digital footprint (Statista, 2020). Despite this, today's
state-of-the-art multilingual systems perform suboptimally on Indian (IN)
languages. This can be explained by the fact that multilingual language models
(LMs) are often trained on 100+ languages together, leading to a small
representation of IN languages in their vocabulary and training data.
Multilingual LMs are substantially less effective in resource-lean scenarios
(Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help
capture the various nuances of a language. One also commonly observes IN
language text transliterated to Latin or code-mixed with English, especially in
informal settings (for example, on social media platforms) (Rijhwani et al.,
2017). This phenomenon is not adequately handled by current state-of-the-art
multilingual LMs. To address the aforementioned gaps, we propose MuRIL, a
multilingual LM specifically built for IN languages. MuRIL is trained on
significantly large amounts of IN text corpora only. We explicitly augment
monolingual text corpora with both translated and transliterated document
pairs, that serve as supervised cross-lingual signals in training. MuRIL
significantly outperforms multilingual BERT (mBERT) on all tasks in the
challenging cross-lingual XTREME benchmark (Hu et al., 2020). We also present
results on transliterated (native to Latin script) test sets of the chosen
datasets and demonstrate the efficacy of MuRIL in handling transliterated data.
Related papers
- Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language [7.289015788793582]
This work focuses on increasing technological participation for the S'ami language.
We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages.
We have compiled the available S'ami language resources from the web to create a clean dataset for training language models.
arXiv Detail & Related papers (2024-05-09T13:54:22Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Multilingual Language Model Adaptive Fine-Tuning: A Study on African
Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent.
To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT.
Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.