A Multilingual Parallel Corpora Collection Effort for Indian Languages
- URL: http://arxiv.org/abs/2007.07691v1
- Date: Wed, 15 Jul 2020 14:00:18 GMT
- Title: A Multilingual Parallel Corpora Collection Effort for Indian Languages
- Authors: Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, C V Jawahar
- Abstract summary: We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
- Score: 43.62422999765863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present sentence aligned parallel corpora across 10 Indian Languages -
Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi,
Punjabi, and English - many of which are categorized as low resource. The
corpora are compiled from online sources which have content shared across
languages. The corpora presented significantly extends present resources that
are either not large enough or are restricted to a specific domain (such as
health). We also provide a separate test corpus compiled from an independent
online source that can be independently used for validating the performance in
10 Indian languages. Alongside, we report on the methods of constructing such
corpora using tools enabled by recent advances in machine translation and
cross-lingual retrieval using deep neural network based methods.
Related papers
- First Attempt at Building Parallel Corpora for Machine Translation of
Northeast India's Very Low-Resource Languages [7.124736158080938]
This paper presents the creation of initial bilingual corpora for thirteen low-resource languages of India, all from Northeast India.
It provides initial benchmark neural machine translation results for these languages.
We intend to extend these corpora to include a large number of low-resource Indian languages.
arXiv Detail & Related papers (2023-12-08T00:28:41Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation [0.0]
The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
arXiv Detail & Related papers (2023-06-27T11:06:44Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Samanantar: The Largest Publicly Available Parallel Corpora Collection
for 11 Indic Languages [4.3857077920223295]
Samanantar is the largest publicly available parallel corpora collection for Indic languages.
The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages.
arXiv Detail & Related papers (2021-04-12T16:18:20Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - PMIndia -- A Collection of Parallel Corpora of Languages of India [10.434922903332415]
We describe a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English.
The corpus includes up to 56000 sentences for each language pair.
We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
arXiv Detail & Related papers (2020-01-27T16:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.