ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most
Diverse Translation Sample Pair
- URL: http://arxiv.org/abs/2205.04651v1
- Date: Tue, 10 May 2022 03:40:14 GMT
- Title: ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most
Diverse Translation Sample Pair
- Authors: Alham Fikri Aji, Tirana Noor Fatyanosa, Radityo Eko Prasojo, Philip
Arthur, Suci Fitriany, Salma Qonitah, Nadhifa Zulfa, Tomi Santoso, Mahendra
Data
- Abstract summary: We release our synthetic parallel paraphrase corpus across 17 languages.
Our method relies only on monolingual data and a neural machine translation system to generate paraphrases.
- Score: 8.26923056580688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We release our synthetic parallel paraphrase corpus across 17 languages:
Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi,
Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and
Chinese. Our method relies only on monolingual data and a neural machine
translation system to generate paraphrases, hence simple to apply. We generate
multiple translation samples using beam search and choose the most lexically
diverse pair according to their sentence BLEU. We compare our generated corpus
with the \texttt{ParaBank2}. According to our evaluation, our synthetic
paraphrase pairs are semantically similar and lexically diverse.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models [107.83158521848372]
We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting.
TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv Detail & Related papers (2022-12-15T12:14:25Z) - Multilingual Representation Distillation with Contrastive Learning [20.715534360712425]
We integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences.
We validate our approach with multilingual similarity search and corpus filtering tasks.
arXiv Detail & Related papers (2022-10-10T22:27:04Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining [38.10950540247151]
We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data.
We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM)
The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM.
arXiv Detail & Related papers (2021-05-21T15:39:16Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Paraphrase Generation as Zero-Shot Multilingual Translation:
Disentangling Semantic Similarity from Lexical and Syntactic Diversity [11.564158965143418]
We introduce a simple paraphrase generation algorithm which discourages the production of n-grams that are present in the input.
Our approach enables paraphrase generation in many languages from a single multilingual NMT model.
arXiv Detail & Related papers (2020-08-11T18:05:34Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.