Bhasha-Abhijnaanam: Native-script and romanized Language Identification
for 22 Indic languages
- URL: http://arxiv.org/abs/2305.15814v3
- Date: Thu, 26 Oct 2023 05:57:27 GMT
- Title: Bhasha-Abhijnaanam: Native-script and romanized Language Identification
for 22 Indic languages
- Authors: Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan
- Abstract summary: We create language identification datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.
First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text.
We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script.
- Score: 32.5582250356516
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We create publicly available language identification (LID) datasets and
models in all 22 Indian languages listed in the Indian constitution in both
native-script and romanized text. First, we create Bhasha-Abhijnaanam, a
language identification test set for native-script as well as romanized text
which spans all 22 Indic languages. We also train IndicLID, a language
identifier for all the above-mentioned languages in both native and romanized
script. For native-script text, it has better language coverage than existing
LIDs and is competitive or better than other LIDs. IndicLID is the first LID
for romanized text in Indian languages. Two major challenges for romanized text
LID are the lack of training data and low-LID performance when languages are
similar. We provide simple and effective solutions to these problems. In
general, there has been limited work on romanized text in any language, and our
findings are relevant to other languages that need romanized language
identification. Our models are publicly available at
https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training
and test sets are also publicly available at
https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.
Related papers
- Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Aksharantar: Open Indic-language Transliteration datasets and models for
the Next Billion Users [32.23606056944172]
We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora.
The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts.
Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family.
arXiv Detail & Related papers (2022-05-06T05:13:12Z) - "A Passage to India": Pre-trained Word Embeddings for Indian Languages [30.607474624873014]
We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
arXiv Detail & Related papers (2021-12-27T17:31:04Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Processing South Asian Languages Written in the Latin Script: the
Dakshina Dataset [9.478817207385472]
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet.
arXiv Detail & Related papers (2020-07-02T14:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.