Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia -- Current Stage and Challenges
- URL: http://arxiv.org/abs/2509.11570v1
- Date: Mon, 15 Sep 2025 04:31:22 GMT
- Title: Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia -- Current Stage and Challenges
- Authors: Sampoorna Poria, Xiaolei Huang,
- Abstract summary: This survey examines current efforts and challenges of NLP models for South Asian languages.<n>We present advances and gaps across 3 essential aspects: data, models, & tasks.<n>Our findings highlight substantial issues, including missing data in critical domains (e.g., health), code-mixing, and lack of standardized evaluation benchmarks.
- Score: 2.261759428153489
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Rapid developments of large language models have revolutionized many NLP tasks for English data. Unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages? In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020, with a focus on transformer-based models, such as BERT, T5, & GPT. We present advances and gaps across 3 essential aspects: data, models, & tasks, such as available data sources, fine-tuning strategies, & domain applications. Our findings highlight substantial issues, including missing data in critical domains (e.g., health), code-mixing, and lack of standardized evaluation benchmarks. Our survey aims to raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages. The complete list of resources is available at: https://github.com/trust-nlp/LM4SouthAsia-Survey.
Related papers
- Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation [7.383944919243126]
We propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages.<n>By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto.
arXiv Detail & Related papers (2025-04-07T15:18:34Z) - NaijaNLP: A Survey of Nigerian Low-Resource Languages [0.0]
Three languages -- Hausa, Yorub'a and Igbo -- account for about 60% of the spoken languages in Nigeria.<n>These languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics.<n>This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages.
arXiv Detail & Related papers (2025-02-27T05:48:51Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z) - Decoding the Diversity: A Review of the Indic AI Research Landscape [0.7864304771129751]
Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan.
This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages.
arXiv Detail & Related papers (2024-06-13T19:55:20Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Computational historical linguistics and language diversity in South
Asia [1.5293427903448025]
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies.
This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics.
We claim that data scatteredness is the primary obstacle in the development of South Asian language technology.
arXiv Detail & Related papers (2022-03-23T16:36:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.