Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon
Induction
- URL: http://arxiv.org/abs/2105.12297v1
- Date: Wed, 26 May 2021 02:09:58 GMT
- Title: Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon
Induction
- Authors: Hailong Cao and Tiejun Zhao
- Abstract summary: We propose a transformation-based method to increase the isomorphism of embeddings of two languages.
Our approach can achieve competitive or superior performance compared to state-of-the-art methods.
- Score: 21.782189001319935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Great progress has been made in unsupervised bilingual lexicon induction
(UBLI) by aligning the source and target word embeddings independently trained
on monolingual corpora. The common assumption of most UBLI models is that the
embedding spaces of two languages are approximately isomorphic. Therefore the
performance is bound by the degree of isomorphism, especially on etymologically
and typologically distant languages. To address this problem, we propose a
transformation-based method to increase the isomorphism. Embeddings of two
languages are made to match with each other by rotating and scaling. The method
does not require any form of supervision and can be applied to any language
pair. On a benchmark data set of bilingual lexicon induction, our approach can
achieve competitive or superior performance compared to state-of-the-art
methods, with particularly strong results being found on distant languages.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment [63.0407314271459]
The proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
Experiments show that the proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
arXiv Detail & Related papers (2022-10-09T02:24:35Z) - Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow
Interpolation [48.32604585839687]
Previous adversarial approaches have shown promising results in inducing cross-lingual word embedding without parallel data.
We propose to make use of a sequence of intermediate spaces for smooth bridging.
arXiv Detail & Related papers (2022-10-07T04:37:47Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Multi-Adversarial Learning for Cross-Lingual Word Embeddings [19.407717032782863]
We propose a novel method for inducing cross-lingual word embeddings.
It induces the seed cross-lingual dictionary through multiple mappings, each induced to fit the mapping for one subspace.
Our experiments on unsupervised bilingual lexicon induction show that this method improves performance over previous single-mapping methods.
arXiv Detail & Related papers (2020-10-16T14:54:28Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon
Induction Through Non-Linear Mapping in Latent Space [17.49073364781107]
We propose a novel semi-supervised method to learn cross-lingual word embeddings for bilingual lexicon induction.
Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders.
arXiv Detail & Related papers (2020-04-28T23:28:26Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.