Related papers: MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

URL: http://arxiv.org/abs/2506.01776v2
Date: Tue, 03 Jun 2025 02:53:48 GMT
Title: MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
Authors: Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li,
Abstract summary: MaXIFE is a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages.<n>By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.
Score: 7.343467302769559
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.

Related papers

Multilingual Self-Taught Faithfulness Evaluators [11.200203292660758]
Self-Taught Evaluators for Multilingual Faithfulness is a framework that learns exclusively from synthetic multilingual summarization data.<n>Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
arXiv Detail & Related papers (2025-07-28T12:01:59Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
M-IFEval: Multilingual Instruction-Following Evaluation [2.624902795082451]
The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria.<n>It only includes English instructions, limiting its ability to assess LLMs in other languages.<n>We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions.
arXiv Detail & Related papers (2025-02-07T06:27:04Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs [36.30321941154582]
Hercule is a cross-lingual evaluation model that learns to assign scores to responses based on easily available reference answers in English. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment.
arXiv Detail & Related papers (2024-10-17T09:45:32Z)
Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
MELA: Multilingual Evaluation of Linguistic Acceptability [7.524375463656369]
We present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial.
arXiv Detail & Related papers (2023-11-15T15:25:28Z)
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? [20.476500441734427]
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks. Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
arXiv Detail & Related papers (2023-09-14T06:41:58Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation [93.80733419450225]
This paper analyzes the current state of cross-lingual transfer learning. We extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks.
arXiv Detail & Related papers (2021-04-15T12:26:12Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.