L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence
representations using multilingual BERT
- URL: http://arxiv.org/abs/2304.11434v1
- Date: Sat, 22 Apr 2023 15:45:40 GMT
- Title: L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence
representations using multilingual BERT
- Authors: Samruddhi Deode, Janhavi Gadre, Aditi Kajale, Ananya Joshi, Raviraj
Joshi
- Abstract summary: The multilingual Sentence-BERT (SBERT) models map different languages to common representation space.
We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus.
We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach yields exceptional cross-lingual properties.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multilingual Sentence-BERT (SBERT) models map different languages to
common representation space and are useful for cross-language similarity and
mining tasks. We propose a simple yet effective approach to convert vanilla
multilingual BERT models into multilingual sentence BERT models using synthetic
corpus. We simply aggregate translated NLI or STS datasets of the low-resource
target languages together and perform SBERT-like fine-tuning of the vanilla
multilingual BERT model. We show that multilingual BERT models are inherent
cross-lingual learners and this simple baseline fine-tuning approach without
explicit cross-lingual training yields exceptional cross-lingual properties. We
show the efficacy of our approach on 10 major Indic languages and also show the
applicability of our approach to non-Indic languages German and French. Using
this approach, we further present L3Cube-IndicSBERT, the first multilingual
sentence representation model specifically for Indian languages Hindi, Marathi,
Kannada, Telugu, Malayalam, Tamil, Gujarati, Odia, Bengali, and Punjabi. The
IndicSBERT exhibits strong cross-lingual capabilities and performs
significantly better than alternatives like LaBSE, LASER, and
paraphrase-multilingual-mpnet-base-v2 on Indic cross-lingual and monolingual
sentence similarity tasks. We also release monolingual SBERT models for each of
the languages and show that IndicSBERT performs competitively with its
monolingual counterparts. These models have been evaluated using embedding
similarity scores and classification accuracy.
Related papers
- Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Evaluation of contextual embeddings on less-resourced languages [4.417922173735813]
This paper presents the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages.
In monolingual settings, monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task.
In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models.
arXiv Detail & Related papers (2021-07-22T12:32:27Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - What makes multilingual BERT multilingual? [60.9051207862378]
In this work, we provide an in-depth experimental study to supplement the existing literature of cross-lingual ability.
We compare the cross-lingual ability of non-contextualized and contextualized representation model with the same data.
We found that datasize and context window size are crucial factors to the transferability.
arXiv Detail & Related papers (2020-10-20T05:41:56Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - Are All Languages Created Equal in Multilingual BERT? [22.954688396858085]
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks.
We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages.
arXiv Detail & Related papers (2020-05-18T21:15:39Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.