First Attempt at Building Parallel Corpora for Machine Translation of
Northeast India's Very Low-Resource Languages
- URL: http://arxiv.org/abs/2312.04764v1
- Date: Fri, 8 Dec 2023 00:28:41 GMT
- Title: First Attempt at Building Parallel Corpora for Machine Translation of
Northeast India's Very Low-Resource Languages
- Authors: Atnafu Lambebo Tonja, Melkamu Mersha, Ananya Kalita, Olga Kolesnikova,
Jugal Kalita
- Abstract summary: This paper presents the creation of initial bilingual corpora for thirteen low-resource languages of India, all from Northeast India.
It provides initial benchmark neural machine translation results for these languages.
We intend to extend these corpora to include a large number of low-resource Indian languages.
- Score: 7.124736158080938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the creation of initial bilingual corpora for thirteen
very low-resource languages of India, all from Northeast India. It also
presents the results of initial translation efforts in these languages. It
creates the first-ever parallel corpora for these languages and provides
initial benchmark neural machine translation results for these languages. We
intend to extend these corpora to include a large number of low-resource Indian
languages and integrate the effort with our prior work with African and
American-Indian languages to create corpora covering a large number of
languages from across the world.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Extremely low-resource machine translation for closely related languages [0.0]
This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish.
We find that multilingual learning and synthetic corpora increase the translation quality in every language pair.
We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results.
arXiv Detail & Related papers (2021-05-27T11:27:06Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.