Related papers: Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

URL: http://arxiv.org/abs/2309.07462v2
Date: Tue, 13 Feb 2024 09:10:29 GMT
Title: Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?
Authors: Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
Abstract summary: Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks. Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
Score: 20.476500441734427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models' outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators, specifically GPT-4 in enhancing multilingual evaluation by calibrating them against $20$K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in GPT4-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages.

Related papers

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [60.52580061637301]
MMLU-ProX is a comprehensive benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. We evaluate 25 state-of-the-art large language models (LLMs) using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators [38.681443695708786]
This study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs. We found that excluding the reference answer from the prompt leads to better performance across various languages. Most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages.
arXiv Detail & Related papers (2025-03-06T12:04:29Z)
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models [3.961168847961322]
Large language models (LLMs) are commonly used as evaluators in tasks, where they act as proxies for human preferences or judgments. Existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts. We introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories.
arXiv Detail & Related papers (2024-10-23T06:04:55Z)
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs [36.30321941154582]
Hercule is a cross-lingual evaluation model that learns to assign scores to responses based on easily available reference answers in English. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment.
arXiv Detail & Related papers (2024-10-17T09:45:32Z)
How Does Quantization Affect Multilingual LLMs? [50.867324914368524]
Quantization techniques are widely used to improve inference speed and deployment of large language models. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales.
arXiv Detail & Related papers (2024-07-03T15:39:40Z)
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data [12.852628521840542]
We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations. We find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages.
arXiv Detail & Related papers (2024-06-21T11:00:38Z)
Quantifying Multilingual Performance of Large Language Models Across Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z)
METAL: Towards Multilingual Meta-Evaluation [12.852595634767901]
This study proposes a framework for an end-to-end assessment of Large Language Models (LLMs) as evaluators in multilingual scenarios. We create a dataset covering 10 languages containing native speaker judgments for the task of summarization. We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2.
arXiv Detail & Related papers (2024-04-02T06:14:54Z)
Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction [14.822205658480813]
Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks. This study investigates the performance of LLMs in grammatical error correction (GEC) evaluation by employing prompts inspired by previous research.
arXiv Detail & Related papers (2024-03-26T09:43:15Z)
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.