Related papers: UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

URL: http://arxiv.org/abs/2602.07038v1
Date: Tue, 03 Feb 2026 12:04:56 GMT
Title: UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents
Authors: Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, Maosong Sun,
Abstract summary: Recent Large Multimodal Models have shown promising potential for performing end-to-end KIE directly from document images.<n>We introduce UNIKIE-BENCH, a benchmark designed to rigorously evaluate the KIE capabilities of LMMs.<n>Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts.
Score: 65.14244917622881
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.

Related papers

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs [14.21269233160436]
We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context.<n>This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key.<n>Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs.
arXiv Detail & Related papers (2025-08-29T21:23:48Z)
MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
DocMMIR: A Framework for Document Multi-modal Information Retrieval [35.41540195822167]
We introduce DocMMIR, a novel multi-modal document retrieval framework.<n>We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples.<n>Results show a +31% improvement in MRR@10 compared to the zero-shot baseline.
arXiv Detail & Related papers (2025-05-25T20:58:58Z)
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval.<n>We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching.<n>Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network [3.9472311338123287]
Key information extraction from visually rich documents (VRD) has been a challenging task in document intelligence. We propose a light-weight model named GraphIE that effectively embeds multimodal features such as textual, visual, and layout features from VRD. Extensive experiments on multiple real-world datasets show that GraphIEReviseds generalizes to documents of varied layouts and achieves comparable or better performance compared to previous KIE methods.
arXiv Detail & Related papers (2024-10-02T01:29:49Z)
Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.