MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark
- URL: http://arxiv.org/abs/2503.07144v1
- Date: Mon, 10 Mar 2025 10:20:05 GMT
- Title: MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark
- Authors: Shengkun Ma, Hao Peng, Lei Hou, Juanzi Li,
- Abstract summary: We introduce a novel taxonomy that categorizes the key capabilities required for reading comprehension (RC)<n>Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as sample generators and selection judges.<n> MRCEval is a comprehensive, challenging and accessible benchmark, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions.
- Score: 51.73839215956791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.
Related papers
- CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Lite Unified Modeling for Discriminative Reading Comprehension [68.39862736200045]
We propose a lightweight POS-Enhanced Iterative Co-Attention Network (POI-Net) to handle diverse discriminative MRC tasks synchronously.
Our lite unified design brings model significant improvement with both encoder and decoder components.
The evaluation results on four discriminative MRC benchmarks consistently indicate the general effectiveness and applicability of our model.
arXiv Detail & Related papers (2022-03-26T15:47:19Z) - ExpMRC: Explainability Evaluation for Machine Reading Comprehension [42.483940360860096]
We propose a new benchmark called ExpMRC for evaluating the explainability of the Machine Reading systems.
We use state-of-the-art pre-trained language models to build baseline systems and adopt various unsupervised approaches to extract evidence without a human-annotated training set.
arXiv Detail & Related papers (2021-05-10T06:00:20Z) - Reference Knowledgeable Network for Machine Reading Comprehension [43.352833140317486]
Multi-choice Machine Reading (MRC) is a major and challenging form of MRC tasks.
We propose a novel reference-based knowledge enhancement model based on span extraction called Reference Knowledgeable Network (RekNet)
In detail, RekNet refines fine-grained critical information and defines it as Reference Span, then quotes external knowledge quadruples by the co-occurrence information of Reference Span and answer options.
arXiv Detail & Related papers (2020-12-07T14:11:33Z) - A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics and
Benchmark Datasets [5.54205518616467]
Machine Reading (MRC) is a challenging Natural Language Processing(NLP) research field with wide real-world applications.
A lot of MRC models have already surpassed human performance on various benchmark datasets.
This shows the need for improving existing datasets, evaluation metrics, and models to move current MRC models toward "real" understanding.
arXiv Detail & Related papers (2020-06-21T19:18:54Z) - Machine Reading Comprehension: The Role of Contextualized Language
Models and Beyond [85.53037880415734]
Machine reading comprehension (MRC) aims to teach machines to read and comprehend human languages.
With the burst of deep neural networks and the evolution of contextualized language models (CLMs), the research of MRC has experienced two significant breakthroughs.
arXiv Detail & Related papers (2020-05-13T10:58:50Z) - Retrospective Reader for Machine Reading Comprehension [90.6069071495214]
Machine reading comprehension (MRC) is an AI challenge that requires machine to determine the correct answers to questions based on a given passage.
When unanswerable questions are involved in the MRC task, an essential verification module called verifier is especially required in addition to the encoder.
This paper devotes itself to exploring better verifier design for the MRC task with unanswerable questions.
arXiv Detail & Related papers (2020-01-27T11:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.