Data Augmentation with Unsupervised Machine Translation Improves the
Structural Similarity of Cross-lingual Word Embeddings
- URL: http://arxiv.org/abs/2006.00262v3
- Date: Thu, 3 Jun 2021 07:00:44 GMT
- Title: Data Augmentation with Unsupervised Machine Translation Improves the
Structural Similarity of Cross-lingual Word Embeddings
- Authors: Sosuke Nishikawa, Ryokan Ri and Yoshimasa Tsuruoka
- Abstract summary: Cross-lingual word embedding methods learn a linear transformation matrix that maps two monolingual embedding spaces.
We argue that using a pseudo-parallel corpus generated by an unsupervised machine translation model facilitates the structural similarity of the two embedding spaces.
- Score: 29.467158098595924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised cross-lingual word embedding (CLWE) methods learn a linear
transformation matrix that maps two monolingual embedding spaces that are
separately trained with monolingual corpora. This method relies on the
assumption that the two embedding spaces are structurally similar, which does
not necessarily hold true in general. In this paper, we argue that using a
pseudo-parallel corpus generated by an unsupervised machine translation model
facilitates the structural similarity of the two embedding spaces and improves
the quality of CLWEs in the unsupervised mapping method. We show that our
approach outperforms other alternative approaches given the same amount of
data, and, through detailed analysis, we show that data augmentation with the
pseudo data from unsupervised machine translation is especially effective for
mapping-based CLWEs because (1) the pseudo data makes the source and target
corpora (partially) parallel; (2) the pseudo data contains information on the
original language that helps to learn similar embedding spaces between the
source and target languages.
Related papers
- How Transliterations Improve Crosslingual Alignment [48.929677368744606]
Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives can improve crosslingual alignment.
This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance.
arXiv Detail & Related papers (2024-09-25T20:05:45Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z) - LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon
Induction Through Non-Linear Mapping in Latent Space [17.49073364781107]
We propose a novel semi-supervised method to learn cross-lingual word embeddings for bilingual lexicon induction.
Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders.
arXiv Detail & Related papers (2020-04-28T23:28:26Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.