Related papers: MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

URL: http://arxiv.org/abs/2509.04111v2
Date: Fri, 05 Sep 2025 09:12:03 GMT
Title: MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages
Authors: Dan Saattrup Smart,
Abstract summary: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages.<n>The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles.<n>We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

Related papers

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages [17.175361236651906]
We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages.<n>We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance.<n>We find that using local vs English-translated data can result in differences more than 20 points for the best performing models.
arXiv Detail & Related papers (2025-04-14T16:05:59Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.1875460416205]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.<n>It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.<n>Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
An Open Multilingual System for Scoring Readability of Wikipedia [3.992677070507323]
We develop a multilingual model to score the readability of Wikipedia articles. We create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online childrens. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages.
arXiv Detail & Related papers (2024-06-03T23:07:18Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task. We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions. We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge. However, studies on LMs' factual representation ability have almost invariably been performed on English. We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering [6.452012363895865]
This dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state-of-the-art methods and baselines for generative and extractive question answering. Results indicate this dataset is challenging even in English, but especially in low-resource languages.
arXiv Detail & Related papers (2020-07-30T03:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.