Related papers: MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

URL: http://arxiv.org/abs/2407.01523v3
Date: Tue, 12 Nov 2024 04:37:44 GMT
Title: MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
Authors: Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun,
Abstract summary: This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. It is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models.
Score: 105.10376440302076
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

Related papers

Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark [6.722613897911759]
Document Haystack is a benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long documents.<n>Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents.
arXiv Detail & Related papers (2025-07-18T19:33:15Z)
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts [14.966795545558474]
This paper introduces WikiMixQA, a benchmark for evaluating cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages.<n>We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve 70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required.
arXiv Detail & Related papers (2025-06-18T16:09:18Z)
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document. Existing document understanding benchmarks often assess LVLMs using question-answer formats. We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench) M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z)
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions.
arXiv Detail & Related papers (2025-01-15T14:30:13Z)
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating [40.44974704748952]
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably. Existing document understanding benchmarks have been limited to handling only a small number of pages. We develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents.
arXiv Detail & Related papers (2024-12-24T13:39:32Z)
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts. M3DocRAG can efficiently handle single or many documents while preserving visual information. We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z)
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding [103.69014172427026]
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. We present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG) which can broaden horizons of any MLLM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z)
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding [66.23502779435053]
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks. Existing benchmarks either contain limited fine-grained evaluation samples mixed with other data, or are confined to object-level assessments in natural images. We propose using document images with multi-granularity and multi-modal information to supplement natural images.
arXiv Detail & Related papers (2024-10-25T16:00:55Z)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models [7.428236410246183]
We introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains. We analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context.
arXiv Detail & Related papers (2023-11-15T18:23:17Z)
PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. We propose PDFTriage that enables models to retrieve the context based on either structure or content. Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.