UrduLM: A Resource-Efficient Monolingual Urdu Language Model
- URL: http://arxiv.org/abs/2601.17664v1
- Date: Sun, 25 Jan 2026 02:49:09 GMT
- Title: UrduLM: A Resource-Efficient Monolingual Urdu Language Model
- Authors: Syed Muhammad Ali, Hammad Sajid, Zainab Haider, Ali Muhammad Asad, Haya Fatima, Abdul Samad,
- Abstract summary: Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models.<n>We present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings.<n>In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology -- including corpus, tokenizer, model weights, and evaluation benchmarks -- is released openly to establish a baseline for Urdu NLP research and provide a scalable framework for other underrepresented languages.
Related papers
- Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi [9.65814816271915]
LilMoo is a 0.6-billion- parameter Hindi language model trained entirely from scratch.<n>It is developed through a fully transparent and reproducible pipeline optimized for limited compute environments.<n>Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines.
arXiv Detail & Related papers (2026-03-03T20:31:25Z) - Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training [3.950299047992185]
Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems.<n>We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning.<n>Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
arXiv Detail & Related papers (2026-01-13T02:05:05Z) - UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu [12.952822154200497]
We present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP)<n>UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena.<n>A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement.
arXiv Detail & Related papers (2025-08-01T18:16:37Z) - UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings [0.7874708385247353]
This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture.<n>We leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs.
arXiv Detail & Related papers (2025-02-24T08:38:21Z) - Automatic Speech Recognition for the Ika Language [0.0]
We fine-tune the pretrained wav2vec 2.0 Massively translations Speech Models on a high-quality speech dataset compiled from New Testament Bible Multilingual in Ika.
Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data.
arXiv Detail & Related papers (2024-10-01T11:56:42Z) - Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks [0.9786690381850356]
This study presents in-depth examination of 7 prominent Large Language Models (LLMs) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models.<n>Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Memory-efficient NLLB-200: Language-specific Expert Pruning of a
Massively Multilingual Machine Translation Model [92.91310997807936]
NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages.
We propose a pruning method that enables the removal of up to 80% of experts without further finetuning.
arXiv Detail & Related papers (2022-12-19T19:29:40Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.