Development of Pre-Trained Transformer-based Models for the Nepali Language
- URL: http://arxiv.org/abs/2411.15734v2
- Date: Tue, 19 Aug 2025 11:00:49 GMT
- Title: Development of Pre-Trained Transformer-based Models for the Nepali Language
- Authors: Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal,
- Abstract summary: The Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain.<n>We have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus.<n>Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
Related papers
- Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language [1.6474262142781433]
This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification.<n>Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested.<n>Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models.
arXiv Detail & Related papers (2026-02-27T11:42:38Z) - Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer [0.0]
This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3.<n>The model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
arXiv Detail & Related papers (2025-12-16T16:53:11Z) - NepaliGPT: A Generative Language Model for the Nepali Language [0.10995326465245928]
There is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet.<n>This research proposes textitNepaliGPT, a generative large language model tailored specifically for the Nepali language.
arXiv Detail & Related papers (2025-06-19T15:31:12Z) - Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali [0.20999222360659603]
Domain-adaptive pre-training (DAPT) focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on.
We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting.
We evaluate the adapted model on its performance, forgetting, and knowledge acquisition.
arXiv Detail & Related papers (2024-12-18T13:53:59Z) - Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages [11.540702510360985]
We create a parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages.
We finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks.
arXiv Detail & Related papers (2024-12-12T07:40:55Z) - Fine-Tuning Small Embeddings for Elevated Performance [0.0]
This work has taken an incomplete BERT model with six attention heads pretrained on Nepali language and finetuned it on previously unseen data.
Results demonstrate that even though the oracle is better on average, finetuning the small embeddings drastically improves results compared to the original baseline.
arXiv Detail & Related papers (2024-11-27T07:25:07Z) - Abstractive Summarization of Low resourced Nepali language using Multilingual Transformers [0.0]
The research addresses key challenges associated with summarizing texts in Nepali by first creating a summarization dataset through web scraping.
The performance of the fine-tuned models were then assessed using ROUGE scores and human evaluation.
The 4-bit quantized mBART with LoRA model was found to be effective in generating better Nepali news headlines.
arXiv Detail & Related papers (2024-09-29T05:58:27Z) - Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research.
This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets.
Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - Revisiting CNN for Highly Inflected Bengali and Hindi Language Modeling [0.5382679710017696]
We propose an end to end trainable memory efficient CNN architecture named CoCNN to handle specific characteristics.
In particular, we introduce two learnable convolutional sub-models at word and at sentence level that are end to end trainable.
We show that state-of-the-art (SOTA) Transformer models including pretrained BERT do not necessarily yield the best performance for Bengali and Hindi.
arXiv Detail & Related papers (2021-10-25T15:14:42Z) - Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural
Machine Translation [74.158365847236]
SixT++ is a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages.
It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively.
arXiv Detail & Related papers (2021-10-16T10:59:39Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Pre-training Polish Transformer-based Language Models at Scale [1.0312968200748118]
We present two language models for Polish based on the popular BERT architecture.
We describe our methodology for collecting the data, preparing the corpus, and pre-training the model.
We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements in eleven of them.
arXiv Detail & Related papers (2020-06-07T18:48:58Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.