Processing South Asian Languages Written in the Latin Script: the
Dakshina Dataset
- URL: http://arxiv.org/abs/2007.01176v1
- Date: Thu, 2 Jul 2020 14:57:28 GMT
- Title: Processing South Asian Languages Written in the Latin Script: the
Dakshina Dataset
- Authors: Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke,
Cibu Johny, Isin Demirsahin, Keith Hall
- Abstract summary: This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet.
- Score: 9.478817207385472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the Dakshina dataset, a new resource consisting of text
in both the Latin and native scripts for 12 South Asian languages. The dataset
includes, for each language: 1) native script Wikipedia text; 2) a romanization
lexicon; and 3) full sentence parallel data in both a native script of the
language and the basic Latin alphabet. We document the methods used for
preparation and selection of the Wikipedia text in each language; collection of
attested romanizations for sampled lexicons; and manual romanization of
held-out sentences from the native script collections. We additionally provide
baseline results on several tasks made possible by the dataset, including
single word transliteration, full sentence transliteration, and language
modeling of native script and romanized text. Keywords: romanization,
transliteration, South Asian languages
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization [17.46921734622369]
Romanized text reduces token fertility by 2x-4x.
Romanized text matches or outperforms native script representation across various NLU, NLG, and MT tasks.
arXiv Detail & Related papers (2024-01-25T16:11:41Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Bhasha-Abhijnaanam: Native-script and romanized Language Identification
for 22 Indic languages [32.5582250356516]
We create language identification datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.
First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text.
We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script.
arXiv Detail & Related papers (2023-05-25T07:53:23Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Aksharantar: Open Indic-language Transliteration datasets and models for
the Next Billion Users [32.23606056944172]
We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora.
The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts.
Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family.
arXiv Detail & Related papers (2022-05-06T05:13:12Z) - Language Lexicons for Hindi-English Multilingual Text Processing [0.0]
The present Language Identification techniques presume that a document contains text in one of the fixed set of languages.
Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons.
These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
arXiv Detail & Related papers (2021-06-29T05:42:54Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.