Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents
- URL: http://arxiv.org/abs/2310.16527v1
- Date: Wed, 25 Oct 2023 10:22:30 GMT
- Title: Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents
- Authors: Tofik Ali and Partha Pratim Roy
- Abstract summary: The model is pre-trained and subsequently fine-tuned for various document image analysis tasks.
The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification.
- Score: 8.49076413640561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a deep learning model tailored for document information
analysis, emphasizing document classification, entity relation extraction, and
document visual question answering. The proposed model leverages
transformer-based models to encode all the information present in a document
image, including textual, visual, and layout information. The model is
pre-trained and subsequently fine-tuned for various document image analysis
tasks. The proposed model incorporates three additional tasks during the
pre-training phase, including reading order identification of different layout
segments in a document image, layout segments categorization as per PubLayNet,
and generation of the text sequence within a given layout segment (text block).
The model also incorporates a collective pre-training scheme where losses of
all the tasks under consideration, including pre-training and fine-tuning tasks
with all datasets, are considered. Additional encoder and decoder blocks are
added to the RoBERTa network to generate results for all tasks. The proposed
model achieved impressive results across all tasks, with an accuracy of 95.87%
on the RVL-CDIP dataset for document classification, F1 scores of 0.9306,
0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets
respectively for entity relation extraction, and an ANLS score of 0.8468 on the
DocVQA dataset for visual question answering. The results highlight the
effectiveness of the proposed model in understanding and interpreting complex
document layouts and content, making it a promising tool for document analysis
tasks.
Related papers
- DLAFormer: An End-to-End Transformer For Document Layout Analysis [7.057192434574117]
We propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer.
We treat various DLA sub-tasks as relation prediction problems and consolidate these relation prediction labels into a unified label space.
We introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR.
arXiv Detail & Related papers (2024-05-20T03:34:24Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Unveiling Document Structures with YOLOv5 Layout Detection [0.0]
This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data.
The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data.
arXiv Detail & Related papers (2023-09-29T07:45:10Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution [48.693941280097974]
We propose a large-scale dataset consisting of camera images for visual information extraction (VIE)
We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion.
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
arXiv Detail & Related papers (2023-05-12T14:11:47Z) - DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.