A Simple Approach to Learning Unsupervised Multilingual Embeddings
- URL: http://arxiv.org/abs/2004.05991v2
- Date: Mon, 20 Apr 2020 15:17:01 GMT
- Title: A Simple Approach to Learning Unsupervised Multilingual Embeddings
- Authors: Pratik Jawanpuria, Mayank Meghwanshi, Bamdev Mishra
- Abstract summary: Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without supervision.
We propose a simple, two-stage framework in which we decouple the above two sub-problems and solve them separately using existing techniques.
The proposed approach obtains surprisingly good performance in various tasks such as bilingual lexicon induction, cross-lingual word similarity, multilingual document classification, and multilingual dependency parsing.
- Score: 15.963615360741356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress on unsupervised learning of cross-lingual embeddings in
bilingual setting has given impetus to learning a shared embedding space for
several languages without any supervision. A popular framework to solve the
latter problem is to jointly solve the following two sub-problems: 1) learning
unsupervised word alignment between several pairs of languages, and 2) learning
how to map the monolingual embeddings of every language to a shared
multilingual space. In contrast, we propose a simple, two-stage framework in
which we decouple the above two sub-problems and solve them separately using
existing techniques. The proposed approach obtains surprisingly good
performance in various tasks such as bilingual lexicon induction, cross-lingual
word similarity, multilingual document classification, and multilingual
dependency parsing. When distant languages are involved, the proposed solution
illustrates robustness and outperforms existing unsupervised multilingual word
embedding approaches. Overall, our experimental results encourage development
of multi-stage models for such challenging problems.
Related papers
- Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences.
It aligns the internal sentence representations across different languages via multilingual contrastive learning.
Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - On Efficiently Acquiring Annotations for Multilingual Models [12.304046317362792]
We show that the strategy of joint learning across multiple languages using a single model performs substantially better than the aforementioned alternatives.
We show that this simple approach enables the model to be data efficient by allowing it to arbitrate its annotation budget to query languages it is less certain on.
arXiv Detail & Related papers (2022-04-03T07:42:13Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.