Semi-Supervised Learning for Bilingual Lexicon Induction
- URL: http://arxiv.org/abs/2402.07028v1
- Date: Sat, 10 Feb 2024 19:27:22 GMT
- Title: Semi-Supervised Learning for Bilingual Lexicon Induction
- Authors: Paul Garnier and Gauthier Guinet
- Abstract summary: We consider the problem of aligning two sets of continuous word representations, corresponding to languages, to a common space in order to infer a bilingual lexicon.
Our experiments on standard benchmarks, inferring dictionary from English to more than 20 languages, show that our approach consistently outperforms existing state of the art benchmark.
- Score: 1.8130068086063336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of aligning two sets of continuous word
representations, corresponding to languages, to a common space in order to
infer a bilingual lexicon. It was recently shown that it is possible to infer
such lexicon, without using any parallel data, by aligning word embeddings
trained on monolingual data. Such line of work is called unsupervised bilingual
induction. By wondering whether it was possible to gain experience in the
progressive learning of several languages, we asked ourselves to what extent we
could integrate the knowledge of a given set of languages when learning a new
one, without having parallel data for the latter. In other words, while keeping
the core problem of unsupervised learning in the latest step, we allowed the
access to other corpora of idioms, hence the name semi-supervised. This led us
to propose a novel formulation, considering the lexicon induction as a ranking
problem for which we used recent tools of this machine learning field. Our
experiments on standard benchmarks, inferring dictionary from English to more
than 20 languages, show that our approach consistently outperforms existing
state of the art benchmark. In addition, we deduce from this new scenario
several relevant conclusions allowing a better understanding of the alignment
phenomenon.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - How Lexical is Bilingual Lexicon Induction? [1.3610643403050855]
We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction.
We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2% across all language pairs.
arXiv Detail & Related papers (2024-04-05T17:10:33Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment [49.3253280592705]
We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
arXiv Detail & Related papers (2021-01-01T03:12:42Z) - Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through
Context Anchoring [41.77270308094212]
We propose an alternative mapping approach for word embeddings in languages other than English.
Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them.
Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
arXiv Detail & Related papers (2020-12-31T17:10:14Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - A Simple Approach to Learning Unsupervised Multilingual Embeddings [15.963615360741356]
Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without supervision.
We propose a simple, two-stage framework in which we decouple the above two sub-problems and solve them separately using existing techniques.
The proposed approach obtains surprisingly good performance in various tasks such as bilingual lexicon induction, cross-lingual word similarity, multilingual document classification, and multilingual dependency parsing.
arXiv Detail & Related papers (2020-04-10T05:54:10Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.