Cross-Lingual Word Embeddings for Turkic Languages
- URL: http://arxiv.org/abs/2005.08340v1
- Date: Sun, 17 May 2020 18:57:23 GMT
- Title: Cross-Lingual Word Embeddings for Turkic Languages
- Authors: Elmurod Kuriyozov, Yerai Doval, Carlos G\'omez-Rodr\'iguez
- Abstract summary: Cross-lingual word embeddings can transfer knowledge from a resource-rich language to a low-resource one.
We show how to obtain cross-lingual word embeddings in Turkish, Uzbek, Azeri, Kazakh and Kyrgyz languages.
- Score: 1.418033127602866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been an increasing interest in learning cross-lingual word
embeddings to transfer knowledge obtained from a resource-rich language, such
as English, to lower-resource languages for which annotated data is scarce,
such as Turkish, Russian, and many others. In this paper, we present the first
viability study of established techniques to align monolingual embedding spaces
for Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic family
which is heavily affected by the low-resource constraint. Those techniques are
known to require little explicit supervision, mainly in the form of bilingual
dictionaries, hence being easily adaptable to different domains, including
low-resource ones. We obtain new bilingual dictionaries and new word embeddings
for these languages and show the steps for obtaining cross-lingual word
embeddings using state-of-the-art techniques. Then, we evaluate the results
using the bilingual dictionary induction task. Our experiments confirm that the
obtained bilingual dictionaries outperform previously-available ones, and that
word embeddings from a low-resource language can benefit from resource-rich
closely-related languages when they are aligned together. Furthermore,
evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves that
monolingual word embeddings can, although slightly, benefit from cross-lingual
alignments.
Related papers
- Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment [13.997006139875563]
Cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models.
We introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models.
arXiv Detail & Related papers (2024-04-03T05:58:53Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Discovering Bilingual Lexicons in Polyglot Word Embeddings [32.53342453685406]
In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings.
We present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique can retrieve bilingual lexicons.
Across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words.
arXiv Detail & Related papers (2020-08-31T03:57:50Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z) - A Common Semantic Space for Monolingual and Cross-Lingual
Meta-Embeddings [10.871587311621974]
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings.
Existing word vectors are projected to a common semantic space using linear transformations and averaging.
The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities.
arXiv Detail & Related papers (2020-01-17T15:42:29Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.