PMIndia -- A Collection of Parallel Corpora of Languages of India
- URL: http://arxiv.org/abs/2001.09907v1
- Date: Mon, 27 Jan 2020 16:51:39 GMT
- Title: PMIndia -- A Collection of Parallel Corpora of Languages of India
- Authors: Barry Haddow and Faheem Kirefu
- Abstract summary: We describe a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English.
The corpus includes up to 56000 sentences for each language pair.
We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
- Score: 10.434922903332415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parallel text is required for building high-quality machine translation (MT)
systems, as well as for other multilingual NLP applications. For many South
Asian languages, such data is in short supply. In this paper, we described a
new publicly available corpus (PMIndia) consisting of parallel sentences which
pair 13 major languages of India with English. The corpus includes up to 56000
sentences for each language pair. We explain how the corpus was constructed,
including an assessment of two different automatic sentence alignment methods,
and present some initial NMT results on the corpus.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation [0.0]
The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
arXiv Detail & Related papers (2023-06-27T11:06:44Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus [31.203776611871863]
This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available.
It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0.
Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus.
arXiv Detail & Related papers (2022-02-25T10:52:00Z) - Samanantar: The Largest Publicly Available Parallel Corpora Collection
for 11 Indic Languages [4.3857077920223295]
Samanantar is the largest publicly available parallel corpora collection for Indic languages.
The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages.
arXiv Detail & Related papers (2021-04-12T16:18:20Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.