On the Language Neutrality of Pre-trained Multilingual Representations
- URL: http://arxiv.org/abs/2004.05160v4
- Date: Tue, 29 Sep 2020 18:48:19 GMT
- Title: On the Language Neutrality of Pre-trained Multilingual Representations
- Authors: Jind\v{r}ich Libovick\'y, Rudolf Rosa, Alexander Fraser
- Abstract summary: We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
- Score: 70.93503607755055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual contextual embeddings, such as multilingual BERT and
XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work
probed the cross-linguality of the representations indirectly using zero-shot
transfer learning on morphological and syntactic tasks. We instead investigate
the language-neutrality of multilingual contextual embeddings directly and with
respect to lexical semantics. Our results show that contextual embeddings are
more language-neutral and, in general, more informative than aligned static
word-type embeddings, which are explicitly trained for language neutrality.
Contextual embeddings are still only moderately language-neutral by default, so
we propose two simple methods for achieving stronger language neutrality:
first, by unsupervised centering of the representation for each language and
second, by fitting an explicit projection on small parallel data. Besides, we
show how to reach state-of-the-art accuracy on language identification and
match the performance of statistical methods for word alignment of parallel
sentences without using parallel data.
Related papers
- Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models? [15.90185747024602]
We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into two questions.
We find that models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages.
We propose a novel approach - multilingual pretraining with unified output space - that both induces the learning of language-neutral representation and facilitates cross-lingual transfer.
arXiv Detail & Related papers (2024-04-18T18:03:08Z) - Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences.
It aligns the internal sentence representations across different languages via multilingual contrastive learning.
Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - A Common Semantic Space for Monolingual and Cross-Lingual
Meta-Embeddings [10.871587311621974]
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings.
Existing word vectors are projected to a common semantic space using linear transformations and averaging.
The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities.
arXiv Detail & Related papers (2020-01-17T15:42:29Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.