Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora
- URL: http://arxiv.org/abs/2010.14649v2
- Date: Wed, 20 Oct 2021 01:52:43 GMT
- Title: Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora
- Authors: Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han
Lau
- Abstract summary: We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
- Score: 63.5286019659504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new approach for learning contextualised cross-lingual word
embeddings based on a small parallel corpus (e.g. a few hundred sentence
pairs). Our method obtains word embeddings via an LSTM encoder-decoder model
that simultaneously translates and reconstructs an input sentence. Through
sharing model parameters among different languages, our model jointly trains
the word embeddings in a common cross-lingual space. We also propose to combine
word and subword embeddings to make use of orthographic similarities across
different languages. We base our experiments on real-world data from endangered
languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on
bilingual lexicon induction and word alignment tasks show that our model
outperforms existing methods by a large margin for most language pairs. These
results demonstrate that, contrary to common belief, an encoder-decoder
translation model is beneficial for learning cross-lingual representations even
in extremely low-resource conditions. Furthermore, our model also works well on
high-resource conditions, achieving state-of-the-art performance on a
German-English word-alignment task.
Related papers
- Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning [5.5119571570277826]
Cross-lingual word alignment plays a crucial role in various natural language processing tasks.
Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings.
We propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework.
arXiv Detail & Related papers (2024-07-06T11:56:41Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.