Itihasa: A large-scale corpus for Sanskrit to English translation
- URL: http://arxiv.org/abs/2106.03269v2
- Date: Tue, 8 Jun 2021 16:48:17 GMT
- Title: Itihasa: A large-scale corpus for Sanskrit to English translation
- Authors: Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders
S{\o}gaard
- Abstract summary: Itihasa is a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations.
We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances.
- Score: 9.566221218224637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work introduces Itihasa, a large-scale translation dataset containing
93,000 pairs of Sanskrit shlokas and their English translations. The shlokas
are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We
first describe the motivation behind the curation of such a dataset and follow
up with empirical analysis to bring out its nuances. We then benchmark the
performance of standard translation models on this corpus and show that even
state-of-the-art transformer architectures perform poorly, emphasizing the
complexity of the dataset.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - SÄmayik: A Benchmark and Dataset for English-Sanskrit Translation [30.315293326789828]
S=amayik is a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose.
S=amayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials.
arXiv Detail & Related papers (2023-05-23T12:32:24Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Semantic and sentiment analysis of selected Bhagavad Gita translations
using BERT-based language framework [0.4125187280299248]
The Bhagavad Gita is an ancient Hindu philosophical text originally written in Sanskrit that features a conversation between Lord Krishna and Arjuna prior to the Mahabharata war.
In this paper, we compare selected translations (mostly from Sanskrit to English) of the Bhagavad Gita using semantic and sentiment analyses.
arXiv Detail & Related papers (2022-01-09T23:59:11Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z) - An Augmented Translation Technique for low Resource language pair:
Sanskrit to Hindi translation [0.0]
In this work, Zero Shot Translation (ZST) is inspected for a low resource language pair.
The same architecture is tested for Sanskrit to Hindi translation for which data is sparse.
Dimensionality reduction of word embedding is performed to reduce the memory usage for data storage.
arXiv Detail & Related papers (2020-06-09T17:01:55Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.