Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
- URL: http://arxiv.org/abs/2507.01853v3
- Date: Sat, 12 Jul 2025 05:20:11 GMT
- Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
- Authors: Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh,
- Abstract summary: EKA-eval is a unified evaluation framework that integrates over 35+ benchmarks across nine major evaluation categories.<n>It provides 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability.<n>The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
- Score: 1.1957520154275776
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
Related papers
- PARAM-1 BharatGen 2.9B Model [14.552007884700618]
PARAM-1 is a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity.<n>It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks.
arXiv Detail & Related papers (2025-07-16T06:14:33Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - IberBench: LLM Evaluation on Iberian Languages [2.3034630097498883]
Large Language Models (LLMs) are difficult to evaluate comprehensively, particularly for languages other than English.<n>We present IberBench, a benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks.<n>We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.
arXiv Detail & Related papers (2025-04-23T17:48:25Z) - WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.<n>We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.<n>This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z) - MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation [13.440594349043916]
We develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG)<n>Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs)<n>We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs.
arXiv Detail & Related papers (2025-02-24T13:58:42Z) - MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)<n>MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.<n>We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - Analysis of Indic Language Capabilities in LLMs [0.3599866690398789]
This report evaluates the performance of text-in text-out Large Language Models (LLMs) to understand and generate Indic languages.<n> Hindi is the most widely represented language in models.<n>While model performance roughly correlates with number of speakers for the top five languages, the assessment after that varies.
arXiv Detail & Related papers (2025-01-23T18:49:33Z) - L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context [0.4194295877935868]
We present the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset.
The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region.
arXiv Detail & Related papers (2024-09-13T10:48:35Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite [17.764840326809797]
We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
arXiv Detail & Related papers (2023-09-15T14:52:23Z) - Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR [14.15737970309719]
We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark.
IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER.
We open-source all datasets, code and models.
arXiv Detail & Related papers (2023-05-24T17:46:03Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.