Decoding the Diversity: A Review of the Indic AI Research Landscape
- URL: http://arxiv.org/abs/2406.09559v1
- Date: Thu, 13 Jun 2024 19:55:20 GMT
- Title: Decoding the Diversity: A Review of the Indic AI Research Landscape
- Authors: Sankalp KJ, Vinija Jain, Sreyoshi Bhaduri, Tamoghna Roy, Aman Chadha,
- Abstract summary: Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan.
This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages.
- Score: 0.7864304771129751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.
Related papers
- MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing Large Language Models in Indic languages.
We introduce MILU, a comprehensive evaluation benchmark designed to address this gap.
With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages [12.514648269553104]
IndicGenBench is the largest benchmark for evaluating large language models (LLMs)
It is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering.
The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English.
arXiv Detail & Related papers (2024-04-25T17:57:36Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - An Overview of Indian Spoken Language Recognition from Machine Learning
Perspective [7.27448284043116]
This work is one of the first attempts to present a comprehensive review of the Indian spoken language recognition research field.
In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts.
arXiv Detail & Related papers (2022-11-30T11:03:51Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals.
This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.