Bilingual alignment transfers to multilingual alignment for unsupervised
parallel text mining
- URL: http://arxiv.org/abs/2104.07642v1
- Date: Thu, 15 Apr 2021 17:51:22 GMT
- Title: Bilingual alignment transfers to multilingual alignment for unsupervised
parallel text mining
- Authors: Chih-chan Tien, Shane Steinert-Threlkeld
- Abstract summary: This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts.
We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations.
- Score: 3.4519649635864584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents methods for learning cross-lingual sentence
representations using paired or unpaired bilingual texts. We hypothesize that
the cross-lingual alignment strategy is transferable, and therefore a model
trained to align only two languages can encode multilingually more aligned
representations. And such transfer from bilingual alignment to multilingual
alignment is a dual-pivot transfer from two pivot languages to other language
pairs. To study this theory, we train an unsupervised model with unpaired
sentences and another single-pair supervised model with bitexts, both based on
the unsupervised language model XLM-R. The experiments evaluate the models as
universal sentence encoders on the task of unsupervised bitext mining on two
datasets, where the unsupervised model reaches the state of the art of
unsupervised retrieval, and the alternative single-pair supervised model
approaches the performance of multilingually supervised models. The results
suggest that bilingual training techniques as proposed can be applied to get
sentence representations with higher multilingual alignment.
Related papers
- Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining [38.10950540247151]
We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data.
We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM)
The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM.
arXiv Detail & Related papers (2021-05-21T15:39:16Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Globetrotter: Unsupervised Multilingual Translation from Visual
Alignment [24.44204156935044]
We introduce a framework that uses the visual modality to align multiple languages.
We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations.
Our language representations are trained jointly in one model with a single stage.
arXiv Detail & Related papers (2020-12-08T18:50:40Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.