IberBench: LLM Evaluation on Iberian Languages
- URL: http://arxiv.org/abs/2504.16921v1
- Date: Wed, 23 Apr 2025 17:48:25 GMT
- Title: IberBench: LLM Evaluation on Iberian Languages
- Authors: José Ángel González, Ian Borrego Obrador, Álvaro Romo Herrero, Areg Mikael Sarvazyan, Mara Chinea-Ríos, Angelo Basile, Marc Franco-Salvador,
- Abstract summary: Large Language Models (LLMs) are difficult to evaluate comprehensively, particularly for languages other than English.<n>We present IberBench, a benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks.<n>We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.
- Score: 2.3034630097498883
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.
Related papers
- MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators [8.672875654352689]
This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating dialogue evaluation benchmarks.<n>We generate user-chatbot multilingual dialogues conditioned on varied seed contexts.<n>A strong LLM is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences.<n>A benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues.
arXiv Detail & Related papers (2025-05-28T18:45:42Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language [16.21019515431378]
We propose MUG-Eval, a novel framework that evaluates large language models' multilingual generation capabilities.<n>We transform existing benchmarks into conversational tasks and measure the LLMs' accuracies on those tasks.<n>We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks.
arXiv Detail & Related papers (2025-05-20T14:14:00Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.
Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z) - Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators [38.681443695708786]
This study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs.
We found that excluding the reference answer from the prompt leads to better performance across various languages.
Most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages.
arXiv Detail & Related papers (2025-03-06T12:04:29Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.
We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench.
MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - Evalita-LLM: Benchmarking Large Language Models on Italian [3.3334839725239798]
Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks.
All tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases.
The benchmark includes generative tasks, enabling more natural interaction with LLMs.
arXiv Detail & Related papers (2025-02-04T12:58:19Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding [15.93642619347214]
We introduce proverbeval, LLM evaluation benchmark for low-resource languages.<n>Native language proverb descriptions significantly improve tasks such as proverb generation.<n> monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks.
arXiv Detail & Related papers (2024-11-07T06:34:48Z) - Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions.
Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks.
languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z) - Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.<n>We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.<n>Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Are Large Language Model-based Evaluators the Solution to Scaling Up
Multilingual Evaluation? [20.476500441734427]
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks.
Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
arXiv Detail & Related papers (2023-09-14T06:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.