MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
- URL: http://arxiv.org/abs/2502.18993v1
- Date: Wed, 26 Feb 2025 09:56:51 GMT
- Title: MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
- Authors: Teng Lin,
- Abstract summary: Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM)<n>We introduce MEBench, a novel multi-document, multi-entity benchmark designed to evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information.<n>Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
Related papers
- M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.
Existing document understanding benchmarks often assess LVLMs using question-answer formats.
We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)
M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph [10.297615455470133]
Multi-entity question answering (MEQA) poses significant challenges for large language models.
This paper introduces a structured RAG framework that organizes extracted entities into relational tables.
Experiments on Wikipedia-based multi-entity QA tasks demonstrate that SRAG significantly outperforms state-of-the-art long-context LLMs.
arXiv Detail & Related papers (2025-03-03T09:37:33Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension [8.489816179329832]
We present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of large language models (LLMs) in tackling complex QA tasks over relational data.<n>Our benchmark incorporates diverse relational database instances sourced from real-world public datasets.<n>We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters.
arXiv Detail & Related papers (2024-11-29T06:48:13Z) - M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models [27.910693214922052]
We introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark for evaluating foundation models.
M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters.
Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents.
arXiv Detail & Related papers (2024-11-06T17:52:01Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal
Sentiment Classification [70.9087014537896]
Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars.
To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets.
arXiv Detail & Related papers (2023-10-14T14:52:37Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.