GMN: Generative Multi-modal Network for Practical Document Information
Extraction
- URL: http://arxiv.org/abs/2207.04713v1
- Date: Mon, 11 Jul 2022 08:52:36 GMT
- Title: GMN: Generative Multi-modal Network for Practical Document Information
Extraction
- Authors: Haoyu Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang,
Yinsong Liu, Bo Ren
- Abstract summary: Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world.
This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems.
With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order.
- Score: 9.24332309286413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document Information Extraction (DIE) has attracted increasing attention due
to its various advanced applications in the real world. Although recent
literature has already achieved competitive results, these approaches usually
fail when dealing with complex documents with noisy OCR results or mutative
layouts. This paper proposes Generative Multi-modal Network (GMN) for
real-world scenarios to address these problems, which is a robust multi-modal
generation method without predefined label categories. With the carefully
designed spatial encoder and modal-aware mask module, GMN can deal with complex
documents that are hard to serialized into sequential order. Moreover, GMN
tolerates errors in OCR results and requires no character-level annotation,
which is vital because fine-grained annotation of numerous documents is
laborious and even requires annotators with specialized domain knowledge.
Extensive experiments show that GMN achieves new state-of-the-art performance
on several public DIE datasets and surpasses other methods by a large margin,
especially in realistic scenes.
Related papers
- UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents [65.14244917622881]
Recent Large Multimodal Models have shown promising potential for performing end-to-end KIE directly from document images.<n>We introduce UNIKIE-BENCH, a benchmark designed to rigorously evaluate the KIE capabilities of LMMs.<n>Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts.
arXiv Detail & Related papers (2026-02-03T12:04:56Z) - Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation [61.47019392413271]
WinnowRAG is designed to systematically filter out noisy documents while preserving valuable content.<n>WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters.<n>In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones.
arXiv Detail & Related papers (2025-11-01T20:08:13Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [56.790045049514326]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z) - SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction [29.174133313633817]
Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science.<n>Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets.<n>This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges.
arXiv Detail & Related papers (2025-09-27T12:01:52Z) - Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.
Existing document understanding benchmarks often assess LVLMs using question-answer formats.
We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)
M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Self-adaptive Multimodal Retrieval-Augmented Generation [0.0]
We propose a new approach called Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG)
SAM-RAG not only dynamically filters relevant documents based on the input query, including image captions when needed, but also verifies the quality of both the retrieved documents and the output.
Extensive experimental results show that SAM-RAG surpasses existing state-of-the-art methods in both retrieval accuracy and response generation.
arXiv Detail & Related papers (2024-10-15T06:39:35Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - GenKIE: Robust Generative Multimodal Document Key Information Extraction [24.365711528919313]
Key information extraction from scanned documents has gained increasing attention because of its applications in various domains.
We propose a novel generative end-to-end model, named GenKIE, to address the KIE task.
One notable advantage of the generative model is that it enables automatic correction of OCR errors.
arXiv Detail & Related papers (2023-10-24T19:12:56Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - PDSum: Prototype-driven Continuous Summarization of Evolving
Multi-document Sets Stream [33.68263291948121]
We propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS)
We introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization.
PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents.
arXiv Detail & Related papers (2023-02-10T23:43:46Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - End-to-End Information Extraction by Character-Level Embedding and
Multi-Stage Attentional U-Net [0.9137554315375922]
We propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document.
We show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters.
arXiv Detail & Related papers (2021-06-02T05:42:51Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.