Improving Information Extraction on Business Documents with Specific
Pre-Training Tasks
- URL: http://arxiv.org/abs/2309.05429v1
- Date: Mon, 11 Sep 2023 13:05:23 GMT
- Title: Improving Information Extraction on Business Documents with Specific
Pre-Training Tasks
- Authors: Thibault Douzon, Stefan Duffner, Christophe Garcia and J\'er\'emy
Espinas
- Abstract summary: Transformer-based Language Models are widely used in Natural Language Processing related tasks.
We introduce two new pre-training tasks that force the model to learn better-contextualized representations of the scanned documents.
We also introduce a new post-processing algorithm to decode BIESO tags in Information Extraction that performs better with complex entities.
- Score: 1.9331361036118608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based Language Models are widely used in Natural Language
Processing related tasks. Thanks to their pre-training, they have been
successfully adapted to Information Extraction in business documents. However,
most pre-training tasks proposed in the literature for business documents are
too generic and not sufficient to learn more complex structures. In this paper,
we use LayoutLM, a language model pre-trained on a collection of business
documents, and introduce two new pre-training tasks that further improve its
capacity to extract relevant information. The first is aimed at better
understanding the complex layout of documents, and the second focuses on
numeric values and their order of magnitude. These tasks force the model to
learn better-contextualized representations of the scanned documents. We
further introduce a new post-processing algorithm to decode BIESO tags in
Information Extraction that performs better with complex entities. Our method
significantly improves extraction performance on both public (from 93.88 to
95.50 F1 score) and private (from 84.35 to 84.84 F1 score) datasets composed of
expense receipts, invoices, and purchase orders.
Related papers
- Instruction-tuned Language Models are Better Knowledge Learners [106.38526595116961]
We propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents.
Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents.
arXiv Detail & Related papers (2024-02-20T09:20:32Z) - Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models [29.94694305204144]
We present a novel framework for document-level in-context few-shot relation extraction.
We evaluate our framework using DocRED, the largest publicly available dataset for document-level relation extraction.
arXiv Detail & Related papers (2023-10-17T09:10:27Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Data-Efficient Information Extraction from Form-Like Documents [14.567098292973075]
Key challenge is that form-like documents can be laid out in virtually infinitely many ways.
Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
arXiv Detail & Related papers (2022-01-07T19:16:49Z) - Robust Layout-aware IE for Visually Rich Documents with Pre-trained
Language Models [23.42593796135709]
We study the problem of information extraction from visually rich documents (VRDs)
We present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents.
arXiv Detail & Related papers (2020-05-22T06:04:50Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.