Related papers: L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

URL: http://arxiv.org/abs/2211.11187v2
Date: Tue, 22 Nov 2022 05:38:55 GMT
Title: L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi
Authors: Ananya Joshi, Aditi Kajale, Janhavi Gadre, Samruddhi Deode, Raviraj Joshi
Abstract summary: This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
Score: 0.7874708385247353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

Related papers

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z)
Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT) We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z)
Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages [1.14219428942199]
We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets.
arXiv Detail & Related papers (2022-11-21T13:02:52Z)
Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Self-Training Vision Language BERTs with a Unified Conditional Model [51.11025371762571]
We propose a self-training approach that allows training VL-BERTs from unlabeled image data. We use the labeled image data to train a teacher model and use the trained model to generate pseudo captions on unlabeled image data. By using the proposed self-training approach and only 300k unlabeled extra data, we are able to get competitive or even better performances.
arXiv Detail & Related papers (2022-01-06T11:00:52Z)
Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT. We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT) It shows its state-of-the-art performance compared to other architectures and multilingual models. ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.