Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through
Context Anchoring
- URL: http://arxiv.org/abs/2012.15715v1
- Date: Thu, 31 Dec 2020 17:10:14 GMT
- Title: Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through
Context Anchoring
- Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko
Agirre
- Abstract summary: We propose an alternative mapping approach for word embeddings in languages other than English.
Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them.
Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
- Score: 41.77270308094212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research on cross-lingual word embeddings has been dominated by
unsupervised mapping approaches that align monolingual embeddings. Such methods
critically rely on those embeddings having a similar structure, but it was
recently shown that the separate training in different languages causes
departures from this assumption. In this paper, we propose an alternative
approach that does not have this limitation, while requiring a weak seed
dictionary (e.g., a list of identical words) as the only form of supervision.
Rather than aligning two fixed embedding spaces, our method works by fixing the
target language embeddings, and learning a new set of embeddings for the source
language that are aligned with them. To that end, we use an extension of
skip-gram that leverages translated context words as anchor points, and
incorporates self-learning and iterative restarts to reduce the dependency on
the initial dictionary. Our approach outperforms conventional mapping methods
on bilingual lexicon induction, and obtains competitive results in the
downstream XNLI task.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Unsupervised Alignment of Distributional Word Embeddings [0.0]
Cross-domain alignment play a key role in tasks ranging from machine translation to transfer learning.
We show that the proposed approach achieves good performance on the bilingual lexicon induction task across several language pairs.
arXiv Detail & Related papers (2022-03-09T16:39:06Z) - Combining Static Word Embeddings and Contextual Representations for
Bilingual Lexicon Induction [19.375597786174197]
We propose a simple yet effective mechanism to combine the static word embeddings and the contextual representations.
We test the combination mechanism on various language pairs under the supervised and unsupervised BLI benchmark settings.
arXiv Detail & Related papers (2021-06-06T10:31:02Z) - Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment [49.3253280592705]
We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
arXiv Detail & Related papers (2021-01-01T03:12:42Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Multi-Adversarial Learning for Cross-Lingual Word Embeddings [19.407717032782863]
We propose a novel method for inducing cross-lingual word embeddings.
It induces the seed cross-lingual dictionary through multiple mappings, each induced to fit the mapping for one subspace.
Our experiments on unsupervised bilingual lexicon induction show that this method improves performance over previous single-mapping methods.
arXiv Detail & Related papers (2020-10-16T14:54:28Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.