EXAMS: A Multi-Subject High School Examinations Dataset for
Cross-Lingual and Multilingual Question Answering
- URL: http://arxiv.org/abs/2011.03080v1
- Date: Thu, 5 Nov 2020 20:06:50 GMT
- Title: EXAMS: A Multi-Subject High School Examinations Dataset for
Cross-Lingual and Multilingual Question Answering
- Authors: Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov,
Ivan Koychev, Preslav Nakov
- Abstract summary: EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations.
We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences.
- Score: 22.926709247193724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose EXAMS -- a new benchmark dataset for cross-lingual and
multilingual question answering for high school examinations. We collected more
than 24,000 high-quality high school exam questions in 16 languages, covering 8
language families and 24 school subjects from Natural Sciences and Social
Sciences, among others.
EXAMS offers a fine-grained evaluation framework across multiple languages
and subjects, which allows precise analysis and comparison of various models.
We perform various experiments with existing top-performing multilingual
pre-trained models and we show that EXAMS offers multiple challenges that
require multilingual knowledge and reasoning in multiple domains. We hope that
EXAMS will enable researchers to explore challenging reasoning and knowledge
transfer methods and pre-trained models for school question answering in
various languages which was not possible before. The data, code, pre-trained
models, and evaluation are available at https://github.com/mhardalov/exams-qa.
Related papers
- CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data [31.324617466692754]
CJEval is a benchmark based on Chinese Junior High School Exam Evaluations.
It consists of 26,136 samples across four application-level educational tasks covering ten subjects.
By utilizing this benchmark, we assessed LLMs' potential applications and conducted a comprehensive analysis of their performance.
arXiv Detail & Related papers (2024-09-24T16:00:28Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models [29.31649801849329]
EXAMS-V is a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models.
It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies.
The questions come in 11 languages from 7 language families.
arXiv Detail & Related papers (2024-03-15T15:08:39Z) - Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation
over More Languages and Beyond [89.54151859266202]
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework.
The challenge garnered 12 model submissions and 54 language corpora, resulting in a comprehensive benchmark encompassing 154 languages.
The findings indicate that merely scaling models is not the definitive solution for multilingual speech tasks.
arXiv Detail & Related papers (2023-10-09T08:30:01Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.