GMN: Generative Multi-modal Network for Practical Document Information
Extraction
- URL: http://arxiv.org/abs/2207.04713v1
- Date: Mon, 11 Jul 2022 08:52:36 GMT
- Title: GMN: Generative Multi-modal Network for Practical Document Information
Extraction
- Authors: Haoyu Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang,
Yinsong Liu, Bo Ren
- Abstract summary: Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world.
This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems.
With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order.
- Score: 9.24332309286413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document Information Extraction (DIE) has attracted increasing attention due
to its various advanced applications in the real world. Although recent
literature has already achieved competitive results, these approaches usually
fail when dealing with complex documents with noisy OCR results or mutative
layouts. This paper proposes Generative Multi-modal Network (GMN) for
real-world scenarios to address these problems, which is a robust multi-modal
generation method without predefined label categories. With the carefully
designed spatial encoder and modal-aware mask module, GMN can deal with complex
documents that are hard to serialized into sequential order. Moreover, GMN
tolerates errors in OCR results and requires no character-level annotation,
which is vital because fine-grained annotation of numerous documents is
laborious and even requires annotators with specialized domain knowledge.
Extensive experiments show that GMN achieves new state-of-the-art performance
on several public DIE datasets and surpasses other methods by a large margin,
especially in realistic scenes.
Related papers
- Self-adaptive Multimodal Retrieval-Augmented Generation [0.0]
We propose a new approach called Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG)
SAM-RAG not only dynamically filters relevant documents based on the input query, including image captions when needed, but also verifies the quality of both the retrieved documents and the output.
Extensive experimental results show that SAM-RAG surpasses existing state-of-the-art methods in both retrieval accuracy and response generation.
arXiv Detail & Related papers (2024-10-15T06:39:35Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - GenKIE: Robust Generative Multimodal Document Key Information Extraction [24.365711528919313]
Key information extraction from scanned documents has gained increasing attention because of its applications in various domains.
We propose a novel generative end-to-end model, named GenKIE, to address the KIE task.
One notable advantage of the generative model is that it enables automatic correction of OCR errors.
arXiv Detail & Related papers (2023-10-24T19:12:56Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - PDSum: Prototype-driven Continuous Summarization of Evolving
Multi-document Sets Stream [33.68263291948121]
We propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS)
We introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization.
PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents.
arXiv Detail & Related papers (2023-02-10T23:43:46Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - End-to-End Information Extraction by Character-Level Embedding and
Multi-Stage Attentional U-Net [0.9137554315375922]
We propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document.
We show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters.
arXiv Detail & Related papers (2021-06-02T05:42:51Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.