Related papers: MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

URL: http://arxiv.org/abs/2504.10356v2
Date: Tue, 15 Apr 2025 15:02:53 GMT
Title: MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Authors: Dieuwke Hupkes, Nikolay Bogoychev,
Abstract summary: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages.<n>We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance.<n>We find that using local vs English-translated data can result in differences more than 20 points for the best performing models.
Score: 17.175361236651906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

Related papers

Do Multilingual LLMs have specialized language heads? [0.571097144710995]
This paper explores whether multilingual LLMs have specialized language attention heads for each language.<n>It investigates the possibility of removing language-specific heads for unwanted languages without degrading performance in the targeted languages.
arXiv Detail & Related papers (2026-02-09T13:15:17Z)
Multi-lingual Functional Evaluation for Large Language Models [4.18267450389965]
We create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM) and Cross-Lingual Instruction-Following Eval (CL-IFEval)<n>We find that some static multi-lingual benchmarks capture functional performance much more closely than others.<n>Certain languages (eg. Arabic, English) are the most consistently well performing across evaluation.
arXiv Detail & Related papers (2025-06-25T19:32:31Z)
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate [36.641755706551336]
Large language models (LLMs) provide detailed and impressive responses to queries in English.<n>But are they really consistent at responding to the same query in other languages?<n>We propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy.
arXiv Detail & Related papers (2025-05-28T06:00:21Z)
Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune? [0.0]
This study proposes a method to select languages for instruction tuning in a linguistically informed way. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random.
arXiv Detail & Related papers (2024-10-10T10:57:24Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.<n>We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.<n>Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM [8.858671209228536]
We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
arXiv Detail & Related papers (2023-03-03T13:23:42Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.