L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi
- URL: http://arxiv.org/abs/2211.11187v2
- Date: Tue, 22 Nov 2022 05:38:55 GMT
- Title: L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi
- Authors: Ananya Joshi, Aditi Kajale, Janhavi Gadre, Samruddhi Deode, Raviraj
Joshi
- Abstract summary: This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence representation from vanilla BERT models does not work well on
sentence similarity tasks. Sentence-BERT models specifically trained on STS or
NLI datasets are shown to provide state-of-the-art performance. However,
building these models for low-resource languages is not straightforward due to
the lack of these specialized datasets. This work focuses on two low-resource
Indian languages, Hindi and Marathi. We train sentence-BERT models for these
languages using synthetic NLI and STS datasets prepared using machine
translation. We show that the strategy of NLI pre-training followed by STSb
fine-tuning is effective in generating high-performance sentence-similarity
models for Hindi and Marathi. The vanilla BERT models trained using this simple
strategy outperform the multilingual LaBSE trained using a complex training
strategy. These models are evaluated on downstream text classification and
similarity tasks. We evaluate these models on real text classification datasets
to show embeddings obtained from synthetic data training are generalizable to
real datasets as well and thus represent an effective training strategy for
low-resource languages. We also provide a comparative analysis of sentence
embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT,
xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and
monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release
L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for
Marathi and Hindi respectively. Our work also serves as a guide to building
low-resource sentence embedding models.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for
Devanagari based Hindi and Marathi Languages [1.14219428942199]
We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus.
We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets.
arXiv Detail & Related papers (2022-11-21T13:02:52Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Self-Training Vision Language BERTs with a Unified Conditional Model [51.11025371762571]
We propose a self-training approach that allows training VL-BERTs from unlabeled image data.
We use the labeled image data to train a teacher model and use the trained model to generate pseudo captions on unlabeled image data.
By using the proposed self-training approach and only 300k unlabeled extra data, we are able to get competitive or even better performances.
arXiv Detail & Related papers (2022-01-06T11:00:52Z) - Evaluation of BERT and ALBERT Sentence Embedding Performance on
Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT.
We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.