GenKIE: Robust Generative Multimodal Document Key Information Extraction
- URL: http://arxiv.org/abs/2310.16131v1
- Date: Tue, 24 Oct 2023 19:12:56 GMT
- Title: GenKIE: Robust Generative Multimodal Document Key Information Extraction
- Authors: Panfeng Cao, Ye Wang, Qiang Zhang, Zaiqiao Meng
- Abstract summary: Key information extraction from scanned documents has gained increasing attention because of its applications in various domains.
We propose a novel generative end-to-end model, named GenKIE, to address the KIE task.
One notable advantage of the generative model is that it enables automatic correction of OCR errors.
- Score: 24.365711528919313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Key information extraction (KIE) from scanned documents has gained increasing
attention because of its applications in various domains. Although promising
results have been achieved by some recent KIE approaches, they are usually
built based on discriminative models, which lack the ability to handle optical
character recognition (OCR) errors and require laborious token-level labelling.
In this paper, we propose a novel generative end-to-end model, named GenKIE, to
address the KIE task. GenKIE is a sequence-to-sequence multimodal generative
model that utilizes multimodal encoders to embed visual, layout and textual
features and a decoder to generate the desired output. Well-designed prompts
are leveraged to incorporate the label semantics as the weakly supervised
signals and entice the generation of the key information. One notable advantage
of the generative model is that it enables automatic correction of OCR errors.
Besides, token-level granular annotation is not required. Extensive experiments
on multiple public real-world datasets show that GenKIE effectively generalizes
over different types of documents and achieves state-of-the-art results. Our
experiments also validate the model's robustness against OCR errors, making
GenKIE highly applicable in real-world scenarios.
Related papers
- CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset [14.246172794156987]
$textitCableInspect-AD$ is a high-quality dataset created and annotated by domain experts from Hydro-Qu'ebec, a Canadian public utility.
This dataset includes high-resolution images with challenging real-world anomalies, covering defects with varying severity levels.
We present a comprehensive evaluation protocol based on cross-validation to assess models' performances.
arXiv Detail & Related papers (2024-09-30T14:50:13Z) - Generative Multi-modal Models are Good Class-Incremental Learners [51.5648732517187]
We propose a novel generative multi-modal model (GMM) framework for class-incremental learning.
Our approach directly generates labels for images using an adapted generative model.
Under the Few-shot CIL setting, we have improved by at least 14% accuracy over all the current state-of-the-art methods with significantly less forgetting.
arXiv Detail & Related papers (2024-03-27T09:21:07Z) - Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding [90.77521413857448]
Deep generative models are anchored in three core capabilities -- generating new instances, reconstructing inputs, and learning compact representations.
We introduce Generalized generative adversarial-Decoding Diffusion Probabilistic Models (EDDPMs)
EDDPMs generalize the Gaussian noising-denoising in standard diffusion by introducing parameterized encoding-decoding.
Experiments on text, proteins, and images demonstrate the flexibility to handle diverse data and tasks.
arXiv Detail & Related papers (2024-02-29T10:08:57Z) - GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning [50.7702397913573]
The rapid advancement of photorealistic generators has reached a critical juncture where the discrepancy between authentic and manipulated images is increasingly indistinguishable.
Although there have been a number of publicly available face forgery datasets, the forgery faces are mostly generated using GAN-based synthesis technology.
We propose a large-scale, diverse, and fine-grained high-fidelity dataset, namely GenFace, to facilitate the advancement of deepfake detection.
arXiv Detail & Related papers (2024-02-03T03:13:50Z) - Sequence-to-Sequence Pre-training with Unified Modality Masking for
Visual Document Understanding [3.185382039518151]
GenDoc is a sequence-to-sequence document understanding model pre-trained with unified masking across three modalities.
The proposed model utilizes an encoder-decoder architecture, which allows for increased adaptability to a wide range of downstream tasks.
arXiv Detail & Related papers (2023-05-16T15:25:19Z) - GMN: Generative Multi-modal Network for Practical Document Information
Extraction [9.24332309286413]
Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world.
This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems.
With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order.
arXiv Detail & Related papers (2022-07-11T08:52:36Z) - Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model.
A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations.
We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven
Cloze Reward [42.925345819778656]
We present ASGARD, a novel framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD.
We propose the use of dual encoders---a sequential document encoder and a graph-structured encoder---to maintain the global context and local characteristics of entities.
Results show that our models produce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets.
arXiv Detail & Related papers (2020-05-03T18:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.