Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
- URL: http://arxiv.org/abs/2105.10419v1
- Date: Fri, 21 May 2021 15:39:16 GMT
- Title: Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
- Authors: Ivana Kvapil{\i}kova, Mikel Artetxe, Gorka Labaka, Eneko Agirre,
Ond\v{r}ej Bojar
- Abstract summary: We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data.
We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM)
The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM.
- Score: 38.10950540247151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing models of multilingual sentence embeddings require large parallel
data resources which are not available for low-resource languages. We propose a
novel unsupervised method to derive multilingual sentence embeddings relying
only on monolingual data. We first produce a synthetic parallel corpus using
unsupervised machine translation, and use it to fine-tune a pretrained
cross-lingual masked language model (XLM) to derive the multilingual sentence
representations. The quality of the representations is evaluated on two
parallel corpus mining tasks with improvements of up to 22 F1 points over
vanilla XLM. In addition, we observe that a single synthetic bilingual corpus
is able to improve results for other language pairs.
Related papers
- Parallel Corpus Augmentation using Masked Language Models [2.3020018305241337]
We use Multi-Lingual Masked Language Model to mask and predict alternative words in context.
We use Sentence Embeddings to check and select sentence pairs which are likely to be translations of each other.
arXiv Detail & Related papers (2024-10-04T07:15:07Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner.
Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties.
Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Exploiting Parallel Corpora to Improve Multilingual Embedding based
Document and Sentence Alignment [1.5293427903448025]
This paper presents a weighting mechanism that makes use of available small-scale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment.
Results on a newly created dataset of Sinhala-English, Tamil-English, and Sinhala-Tamil show that this new weighting mechanism significantly improves both document and sentence alignment.
arXiv Detail & Related papers (2021-06-12T13:00:10Z) - Bilingual alignment transfers to multilingual alignment for unsupervised
parallel text mining [3.4519649635864584]
This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts.
We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations.
arXiv Detail & Related papers (2021-04-15T17:51:22Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.