PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine
Translation
- URL: http://arxiv.org/abs/2004.09447v1
- Date: Mon, 20 Apr 2020 17:04:22 GMT
- Title: PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine
Translation
- Authors: Vivek Srivastava and Mayank Singh
- Abstract summary: This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English.
The translations of sentences are done manually by the annotators.
We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation.
- Score: 1.2301855531996841
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-mixing is the phenomenon of using more than one language in a sentence.
It is a very frequently observed pattern of communication on social media
platforms. Flexibility to use multiple languages in one text message might help
to communicate efficiently with the target audience. But, it adds to the
challenge of processing and understanding natural language to a much larger
extent. This paper presents a parallel corpus of the 13,738 code-mixed
English-Hindi sentences and their corresponding translation in English. The
translations of sentences are done manually by the annotators. We are releasing
the parallel corpus to facilitate future research opportunities in code-mixed
machine translation. The annotated corpus is available at
https://doi.org/10.5281/zenodo.3605597.
Related papers
- Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English.
A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training.
In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus [31.203776611871863]
This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available.
It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0.
Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus.
arXiv Detail & Related papers (2022-02-25T10:52:00Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Crowdsourcing Parallel Corpus for English-Oromo Neural Machine
Translation using Community Engagement Platform [0.0]
The paper deals with implementing a translation of English to Afaan Oromo and vice versa using Neural Machine Translation.
Using a bilingual corpus of just over 40k sentence pairs we have collected, this study showed a promising result.
arXiv Detail & Related papers (2021-02-15T13:22:30Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - PMIndia -- A Collection of Parallel Corpora of Languages of India [10.434922903332415]
We describe a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English.
The corpus includes up to 56000 sentences for each language pair.
We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
arXiv Detail & Related papers (2020-01-27T16:51:39Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.