Monolingual and Parallel Corpora for Kangri Low Resource Language
- URL: http://arxiv.org/abs/2103.11596v1
- Date: Mon, 22 Mar 2021 05:52:51 GMT
- Title: Monolingual and Parallel Corpora for Kangri Low Resource Language
- Authors: Shweta Chauhan, Shefali Saxena, Philemon Daniel
- Abstract summary: This paper presents the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO)
The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper we present the dataset of Himachali low resource endangered
language, Kangri (ISO 639-3xnr) listed in the United Nations Educational,
Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus
has been a challenging task due to the non-availability of the digitalized
resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri
Parallel corpora. We shared pre-trained kangri word embeddings. We also
reported the Bilingual Evaluation Understudy (BLEU) score and Metric for
Evaluation of Translation with Explicit ORdering (METEOR) score of Statistical
Machine Translation (SMT) and Neural Machine Translation (NMT) results for the
corpus. The corpus is freely available for non-commercial usages and research.
To the best of our knowledge, this is the first Himachali low resource
endangered language corpus. The resources are available at
(https://github.com/chauhanshweta/Kangri_corpus)
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation [0.0]
The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
arXiv Detail & Related papers (2023-06-27T11:06:44Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Finetuning a Kalaallisut-English machine translation system using
web-crawled data [6.85316573653194]
West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland.
Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites.
arXiv Detail & Related papers (2022-06-05T17:56:55Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Crowdsourcing Parallel Corpus for English-Oromo Neural Machine
Translation using Community Engagement Platform [0.0]
The paper deals with implementing a translation of English to Afaan Oromo and vice versa using Neural Machine Translation.
Using a bilingual corpus of just over 40k sentence pairs we have collected, this study showed a promising result.
arXiv Detail & Related papers (2021-02-15T13:22:30Z) - AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for
Indic Languages [15.425783311152117]
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages.
We share pre-trained word embeddings trained on these corpora.
We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks.
arXiv Detail & Related papers (2020-04-30T20:21:02Z) - Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about
them, their Similarity Estimates, and Baselines for Three Applications [0.6649753747542209]
Bhojpuri, Magahi, and Maithili are low-resource languages of the Purvanchal region of India.
We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels.
The results were compared with a standard Hindi corpus.
arXiv Detail & Related papers (2020-04-29T03:58:55Z) - Practical Comparable Data Collection for Low-Resource Languages via
Images [126.64069379167975]
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators.
Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
arXiv Detail & Related papers (2020-04-24T19:30:38Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.