Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment
- URL: http://arxiv.org/abs/2101.00148v1
- Date: Fri, 1 Jan 2021 03:12:42 GMT
- Title: Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment
- Authors: Haoyue Shi, Luke Zettlemoyer, Sida I. Wang
- Abstract summary: We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
- Score: 49.3253280592705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bilingual lexicons map words in one language to their translations in
another, and are typically induced by learning linear projections to align
monolingual word embedding spaces. In this paper, we show it is possible to
produce much higher quality lexicons with methods that combine (1) unsupervised
bitext mining and (2) unsupervised word alignment. Directly applying a pipeline
that uses recent algorithms for both subproblems significantly improves induced
lexicon quality and further gains are possible by learning to filter the
resulting lexical entries, with both unsupervised and semi-supervised schemes.
Our final model outperforms the state of the art on the BUCC 2020 shared task
by 14 $F_1$ points averaged over 12 language pairs, while also providing a more
interpretable approach that allows for rich reasoning of word meaning in
context.
Related papers
- Semi-Supervised Learning for Bilingual Lexicon Induction [1.8130068086063336]
We consider the problem of aligning two sets of continuous word representations, corresponding to languages, to a common space in order to infer a bilingual lexicon.
Our experiments on standard benchmarks, inferring dictionary from English to more than 20 languages, show that our approach consistently outperforms existing state of the art benchmark.
arXiv Detail & Related papers (2024-02-10T19:27:22Z) - Don't Forget Cheap Training Signals Before Building Unsupervised
Bilingual Word Embeddings [64.06041300946517]
We argue that easy-to-access cross-lingual signals should always be considered when developing unsupervised BWE methods.
We show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs.
Our results show that these training signals should not be neglected when building BWEs, even for distant languages.
arXiv Detail & Related papers (2022-05-31T12:00:55Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Unsupervised Alignment of Distributional Word Embeddings [0.0]
Cross-domain alignment play a key role in tasks ranging from machine translation to transfer learning.
We show that the proposed approach achieves good performance on the bilingual lexicon induction task across several language pairs.
arXiv Detail & Related papers (2022-03-09T16:39:06Z) - Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon
Induction [21.782189001319935]
We propose a transformation-based method to increase the isomorphism of embeddings of two languages.
Our approach can achieve competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-05-26T02:09:58Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through
Context Anchoring [41.77270308094212]
We propose an alternative mapping approach for word embeddings in languages other than English.
Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them.
Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
arXiv Detail & Related papers (2020-12-31T17:10:14Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.