Discovering Bilingual Lexicons in Polyglot Word Embeddings
- URL: http://arxiv.org/abs/2008.13347v1
- Date: Mon, 31 Aug 2020 03:57:50 GMT
- Title: Discovering Bilingual Lexicons in Polyglot Word Embeddings
- Authors: Ashiqur R. KhudaBukhsh, Shriphani Palakodety, Tom M. Mitchell
- Abstract summary: In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings.
We present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique can retrieve bilingual lexicons.
Across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words.
- Score: 32.53342453685406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bilingual lexicons and phrase tables are critical resources for modern
Machine Translation systems. Although recent results show that without any seed
lexicon or parallel data, highly accurate bilingual lexicons can be learned
using unsupervised methods, such methods rely on the existence of large, clean
monolingual corpora. In this work, we utilize a single Skip-gram model trained
on a multilingual corpus yielding polyglot word embeddings, and present a novel
finding that a surprisingly simple constrained nearest-neighbor sampling
technique in this embedding space can retrieve bilingual lexicons, even in
harsh social media data sets predominantly written in English and Romanized
Hindi and often exhibiting code switching. Our method does not require
monolingual corpora, seed lexicons, or any other such resources. Additionally,
across three European language pairs, we observe that polyglot word embeddings
indeed learn a rich semantic representation of words and substantial bilingual
lexicons can be retrieved using our constrained nearest neighbor sampling. We
investigate potential reasons and downstream applications in settings spanning
both clean texts and noisy social media data sets, and in both resource-rich
and under-resourced language pairs.
Related papers
- Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-Lingual Word Embeddings for Turkic Languages [1.418033127602866]
Cross-lingual word embeddings can transfer knowledge from a resource-rich language to a low-resource one.
We show how to obtain cross-lingual word embeddings in Turkish, Uzbek, Azeri, Kazakh and Kyrgyz languages.
arXiv Detail & Related papers (2020-05-17T18:57:23Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.