PICK: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks
- URL: http://arxiv.org/abs/2004.07464v3
- Date: Sat, 18 Jul 2020 08:13:53 GMT
- Title: PICK: Processing Key Information Extraction from Documents using
Improved Graph Learning-Convolutional Networks
- Authors: Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao
- Abstract summary: Key Information Extraction from documents remains a challenge.
We introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE.
Our method outperforms baselines methods by significant margins.
- Score: 5.210482046387142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision with state-of-the-art deep learning models has achieved huge
success in the field of Optical Character Recognition (OCR) including text
detection and recognition tasks recently. However, Key Information Extraction
(KIE) from documents as the downstream task of OCR, having a large number of
use scenarios in real-world, remains a challenge because documents not only
have textual features extracting from OCR systems but also have semantic visual
features that are not fully exploited and play a critical role in KIE. Too
little work has been devoted to efficiently make full use of both textual and
visual features of the documents. In this paper, we introduce PICK, a framework
that is effective and robust in handling complex documents layout for KIE by
combining graph learning with graph convolution operation, yielding a richer
semantic representation containing the textual and visual features and global
layout without ambiguity. Extensive experiments on real-world datasets have
been conducted to show that our method outperforms baselines methods by
significant margins. Our code is available at
https://github.com/wenwenyu/PICK-pytorch.
Related papers
- See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding.
To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z) - VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding [18.609441902943445]
VisFocus is an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt.
We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder.
Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance.
arXiv Detail & Related papers (2024-07-17T14:16:46Z) - Attention Where It Matters: Rethinking Visual Document Understanding
with Selective Region Concentration [26.408343160223517]
We propose a novel end-to-end document understanding model called SeRum.
SeRum converts image understanding and recognition tasks into a local decoding process of the visual tokens of interest.
We show that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks.
arXiv Detail & Related papers (2023-09-03T10:14:34Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Towards Robust Visual Information Extraction in Real World: New Dataset
and Novel Solution [30.438041837029875]
We propose a robust visual information extraction system (VIES) towards real-world scenarios.
VIES is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction.
We construct a fully-annotated dataset called EPHOIE, which is the first Chinese benchmark for both text spotting and visual information extraction.
arXiv Detail & Related papers (2021-01-24T11:05:24Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.