SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation
- URL: http://arxiv.org/abs/2307.00021v1
- Date: Tue, 27 Jun 2023 11:06:44 GMT
- Title: SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation
- Authors: Vishvajitsinh Bakrola and Jitendra Nasariwala
- Abstract summary: The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The data article presents the large bilingual parallel corpus of
low-resourced language pair Sanskrit-Hindi, named SAHAAYAK 2023. The corpus
contains total of 1.5M sentence pairs between Sanskrit and Hindi. To make the
universal usability of the corpus and to make it balanced, data from multiple
domain has been incorporated into the corpus that includes, News, Daily
conversations, Politics, History, Sport, and Ancient Indian Literature. The
multifaceted approach has been adapted to make a sizable multi-domain corpus of
low-resourced languages like Sanskrit. Our development approach is spanned from
creating a small hand-crafted dataset to applying a wide range of mining,
cleaning, and verification. We have used the three-fold process of mining:
mining from machine-readable sources, mining from non-machine readable sources,
and collation from existing corpora sources. Post mining, the dedicated
pipeline for normalization, alignment, and corpus cleaning is developed and
applied to the corpus to make it ready to use on machine translation
algorithms.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - Monolingual and Parallel Corpora for Kangri Low Resource Language [0.0]
This paper presents the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO)
The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora.
arXiv Detail & Related papers (2021-03-22T05:52:51Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about
them, their Similarity Estimates, and Baselines for Three Applications [0.6649753747542209]
Bhojpuri, Magahi, and Maithili are low-resource languages of the Purvanchal region of India.
We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels.
The results were compared with a standard Hindi corpus.
arXiv Detail & Related papers (2020-04-29T03:58:55Z) - PMIndia -- A Collection of Parallel Corpora of Languages of India [10.434922903332415]
We describe a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English.
The corpus includes up to 56000 sentences for each language pair.
We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
arXiv Detail & Related papers (2020-01-27T16:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.