Language-agnostic BERT Sentence Embedding
- URL: http://arxiv.org/abs/2007.01852v2
- Date: Tue, 8 Mar 2022 05:10:16 GMT
- Title: Language-agnostic BERT Sentence Embedding
- Authors: Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang
- Abstract summary: We investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations.
We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%.
- Score: 14.241717104817713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While BERT is an effective method for learning monolingual sentence
embeddings for semantic similarity and embedding based transfer learning
(Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have
yet to be explored. We systematically investigate methods for learning
multilingual sentence embeddings by combining the best methods for learning
monolingual and cross-lingual representations including: masked language
modeling (MLM), translation language modeling (TLM) (Conneau and Lample, 2019),
dual encoder translation ranking (Guo et al., 2018), and additive margin
softmax (Yang et al., 2019a). We show that introducing a pre-trained
multilingual language model dramatically reduces the amount of parallel
training data required to achieve good performance by 80%. Composing the best
of these methods produces a model that achieves 83.7% bi-text retrieval
accuracy over 112 languages on Tatoeba, well above the 65.5% achieved by
Artetxe and Schwenk (2019b), while still performing competitively on
monolingual transfer learning benchmarks (Conneau and Kiela, 2018). Parallel
data mined from CommonCrawl using our best model is shown to train competitive
NMT models for en-zh and en-de. We publicly release our best multilingual
sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.
Related papers
- Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models [0.0]
This paper addresses the deduplication of multilingual textual data using advanced NLP tools.
We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse)
arXiv Detail & Related papers (2024-06-19T16:48:14Z) - Tagengo: A Multilingual Chat Dataset [3.8073142980733]
We present a high quality dataset of more than 70k prompt-response pairs in 74 languages.
We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.
arXiv Detail & Related papers (2024-05-21T09:06:36Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - MiLMo:Minority Multilingual Pre-trained Language Model [1.6409017540235764]
This paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks.
By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages.
arXiv Detail & Related papers (2022-12-04T09:28:17Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.