NepaliGPT: A Generative Language Model for the Nepali Language
- URL: http://arxiv.org/abs/2506.16399v1
- Date: Thu, 19 Jun 2025 15:31:12 GMT
- Title: NepaliGPT: A Generative Language Model for the Nepali Language
- Authors: Shushanta Pudasaini, Aman Shakya, Siddhartha Shrestha, Sahil Bhatta, Sunil Thapa, Sushmita Palikhe,
- Abstract summary: There is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet.<n>This research proposes textitNepaliGPT, a generative large language model tailored specifically for the Nepali language.
- Score: 0.10995326465245928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textit{NepaliGPT}, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25\%, and causal consistency of 85.41\%.
Related papers
- Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language [1.6474262142781433]
This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification.<n>Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested.<n>Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models.
arXiv Detail & Related papers (2026-02-27T11:42:38Z) - Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer [0.0]
This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3.<n>The model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
arXiv Detail & Related papers (2025-12-16T16:53:11Z) - Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali [0.20999222360659603]
Domain-adaptive pre-training (DAPT) focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on.<n>We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting.<n>We evaluate the adapted model on its performance, forgetting, and knowledge acquisition.
arXiv Detail & Related papers (2024-12-18T13:53:59Z) - Development of Pre-Trained Transformer-based Models for the Nepali Language [0.0]
The Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain.
We have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus.
Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks.
arXiv Detail & Related papers (2024-11-24T06:38:24Z) - Abstractive Summarization of Low resourced Nepali language using Multilingual Transformers [0.0]
The research addresses key challenges associated with summarizing texts in Nepali by first creating a summarization dataset through web scraping.
The performance of the fine-tuned models were then assessed using ROUGE scores and human evaluation.
The 4-bit quantized mBART with LoRA model was found to be effective in generating better Nepali news headlines.
arXiv Detail & Related papers (2024-09-29T05:58:27Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali [0.0]
SentencePiece tokenization consistently yields superior results on understanding-based tasks for Nepali.<n>Our research specifically examines sequential transformer models, providing valuable insights for language model development in low-resource languages.
arXiv Detail & Related papers (2024-04-28T05:26:12Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Native Language Identification with Large Language Models [60.80452362519818]
We show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark11 test set in a zero-shot setting.
We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes.
arXiv Detail & Related papers (2023-12-13T00:52:15Z) - COVID-19-related Nepali Tweets Classification in a Low Resource Setting [0.15658704610960567]
We identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language.
We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification.
arXiv Detail & Related papers (2022-10-11T13:08:37Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.