Analysis of Indic Language Capabilities in LLMs
- URL: http://arxiv.org/abs/2501.13912v1
- Date: Thu, 23 Jan 2025 18:49:33 GMT
- Title: Analysis of Indic Language Capabilities in LLMs
- Authors: Aatman Vaidya, Tarunima Prabhakar, Denny George, Swair Shah,
- Abstract summary: This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages.
Hindi is the most widely represented language in models.
While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
- Score: 0.3599866690398789
- License:
- Abstract: This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages. This evaluation is used to identify and prioritize Indic languages suited for inclusion in safety benchmarks. We conduct this study by reviewing existing evaluation studies and datasets; and a set of twenty-eight LLMs that support Indic languages. We analyze the LLMs on the basis of the training data, license for model and data, type of access and model developers. We also compare Indic language performance across evaluation datasets and find that significant performance disparities in performance across Indic languages. Hindi is the most widely represented language in models. While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
Related papers
- Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages [0.0]
This paper presents a comprehensive evaluation of tokenizers used by 12 Large Language Models (LLMs) across all 22 official languages of India.
The SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages.
This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models.
arXiv Detail & Related papers (2024-11-19T05:37:17Z) - L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context [0.4194295877935868]
We present the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset.
The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region.
arXiv Detail & Related papers (2024-09-13T10:48:35Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.
We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.
Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Vy\=akarana: A Colorless Green Benchmark for Syntactic Evaluation in
Indic Languages [0.0]
Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology.
We introduce Vy=akarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models.
We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages.
arXiv Detail & Related papers (2021-03-01T09:07:58Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.