Related papers: One ruler to measure them all: Benchmarking multilingual long-context language models

One ruler to measure them all: Benchmarking multilingual long-context language models

URL: http://arxiv.org/abs/2503.01996v1
Date: Mon, 03 Mar 2025 19:12:48 GMT
Title: One ruler to measure them all: Benchmarking multilingual long-context language models
Authors: Yekyung Kim, Jenna Russell, Marzena Karpinska, Mohit Iyyer,
Abstract summary: We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages.<n>English is not the top-performing language on long-context tasks (ranked 6th out of 26), with Polish emerging as the top language.<n>In cross-lingual scenarios where instructions and context appear in different languages, performance can fluctuate by up to 20% depending on the instruction language.
Score: 35.75388430206553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages. ONERULER adapts the English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic tasks that test both retrieval and aggregation, including new variations of the "needle-in-a-haystack" task that allow for the possibility of a nonexistent needle. We create ONERULER through a two-step process, first writing English instructions for each task and then collaborating with native speakers to translate them into 25 additional languages. Experiments with both open-weight and closed LLMs reveal a widening performance gap between low- and high-resource languages as context length increases from 8K to 128K tokens. Surprisingly, English is not the top-performing language on long-context tasks (ranked 6th out of 26), with Polish emerging as the top language. Our experiments also show that many LLMs (particularly OpenAI's o3-mini-high) incorrectly predict the absence of an answer, even in high-resource languages. Finally, in cross-lingual scenarios where instructions and context appear in different languages, performance can fluctuate by up to 20% depending on the instruction language. We hope the release of ONERULER will facilitate future research into improving multilingual and cross-lingual long-context training pipelines.

Related papers

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z)
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language [16.21019515431378]
We propose MUG-Eval, a novel framework that evaluates large language models' multilingual generation capabilities.<n>We transform existing benchmarks into conversational tasks and measure the LLMs' accuracies on those tasks.<n>We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks.
arXiv Detail & Related papers (2025-05-20T14:14:00Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs [1.2057938662974816]
Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data.<n>We introduce CoCo-CoLa, a novel metric to evaluate language adherence in multilingual LLMs.
arXiv Detail & Related papers (2025-02-18T03:03:53Z)
mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z)
LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning [28.288949710191158]
Large language models (LLMs) have exhibited impressive multilingual reasoning capabilities, driven by extensive multilingual pre-training corpora and instruction fine-tuning data.<n>A performance gap exists between high- and low-resource language reasoning tasks due to the language imbalance in the pre-training corpus.<n>We propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language reasoning.
arXiv Detail & Related papers (2024-12-17T03:03:17Z)
Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models [22.859955360764275]
We introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test to assess a model's ability to retrieve relevant information. We evaluate four state-of-the-art large language models on MLNeedle.
arXiv Detail & Related papers (2024-08-19T17:02:06Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [25.402797722575805]
Indic QA Benchmark is a dataset for context grounded question answering in 11 major Indian languages. Evaluations revealed weak performance in low resource languages due to a strong English language bias in their training data. We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
Revisiting non-English Text Simplification: A Unified Multilingual Benchmark [14.891068432456262]
This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings.
arXiv Detail & Related papers (2023-05-25T03:03:29Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.