Data-Efficient Information Extraction from Form-Like Documents
- URL: http://arxiv.org/abs/2201.02647v1
- Date: Fri, 7 Jan 2022 19:16:49 GMT
- Title: Data-Efficient Information Extraction from Form-Like Documents
- Authors: Beliz Gunel and Navneet Potti and Sandeep Tata and James B. Wendt and
Marc Najork and Jing Xie
- Abstract summary: Key challenge is that form-like documents can be laid out in virtually infinitely many ways.
Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
- Score: 14.567098292973075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating information extraction from form-like documents at scale is a
pressing need due to its potential impact on automating business workflows
across many industries like financial services, insurance, and healthcare. The
key challenge is that form-like documents in these business workflows can be
laid out in virtually infinitely many ways; hence, a good solution to this
problem should generalize to documents with unseen layouts and languages. A
solution to this problem requires a holistic understanding of both the textual
segments and the visual cues within a document, which is non-trivial. While the
natural language processing and computer vision communities are starting to
tackle this problem, there has not been much focus on (1) data-efficiency, and
(2) ability to generalize across different document types and languages.
In this paper, we show that when we have only a small number of labeled
documents for training (~50), a straightforward transfer learning approach from
a considerably structurally-different larger labeled corpus yields up to a 27
F1 point improvement over simply training on the small corpus in the target
domain. We improve on this with a simple multi-domain transfer learning
approach, that is currently in production use, and show that this yields up to
a further 8 F1 point improvement. We make the case that data efficiency is
critical to enable information extraction systems to scale to handle hundreds
of different document-types, and learning good representations is critical to
accomplishing this.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Improving Information Extraction on Business Documents with Specific
Pre-Training Tasks [1.9331361036118608]
Transformer-based Language Models are widely used in Natural Language Processing related tasks.
We introduce two new pre-training tasks that force the model to learn better-contextualized representations of the scanned documents.
We also introduce a new post-processing algorithm to decode BIESO tags in Information Extraction that performs better with complex entities.
arXiv Detail & Related papers (2023-09-11T13:05:23Z) - Multimodal Document Analytics for Banking Process Automation [4.541582055558865]
The paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business.
It offers practical guidance on how to unlock this potential in day-to-day operations.
arXiv Detail & Related papers (2023-07-21T18:29:04Z) - An Augmentation Strategy for Visually Rich Documents [13.428304945684621]
We propose a novel data augmentation technique to improve performance when training data is scarce.
Our technique, which we call FieldSwap, works by swapping out the key phrases of a source field with the key phrases of a target field.
We demonstrate that this approach can yield 1-7 F1 point improvements in extraction performance.
arXiv Detail & Related papers (2022-12-20T07:44:25Z) - Delivering Document Conversion as a Cloud Service with High Throughput
and Responsiveness [0.0]
We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced.
Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
arXiv Detail & Related papers (2022-06-01T22:30:30Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.