DocumentNet: Bridging the Data Gap in Document Pre-Training
- URL: http://arxiv.org/abs/2306.08937v3
- Date: Thu, 26 Oct 2023 16:23:15 GMT
- Title: DocumentNet: Bridging the Data Gap in Document Pre-Training
- Authors: Lijun Yu, Jin Miao, Xiaoyu Sun, Jiayi Chen, Alexander G. Hauptmann,
Hanjun Dai, Wei Wei
- Abstract summary: We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
- Score: 78.01647768018485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document understanding tasks, in particular, Visually-rich Document Entity
Retrieval (VDER), have gained significant attention in recent years thanks to
their broad applications in enterprise AI. However, publicly available data
have been scarce for these tasks due to strict privacy constraints and high
annotation costs. To make things worse, the non-overlapping entity spaces from
different datasets hinder the knowledge transfer between document types. In
this paper, we propose a method to collect massive-scale and weakly labeled
data from the web to benefit the training of VDER models. The collected
dataset, named DocumentNet, does not depend on specific document types or
entity sets, making it universally applicable to all VDER tasks. The current
DocumentNet consists of 30M documents spanning nearly 400 document types
organized in a four-level ontology. Experiments on a set of broadly adopted
VDER tasks show significant improvements when DocumentNet is incorporated into
the pre-training for both classic and few-shot learning settings. With the
recent emergence of large language models (LLMs), DocumentNet provides a large
data source to extend their multi-modal capabilities for VDER.
Related papers
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z) - BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents.
Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - IncDSI: Incrementally Updatable Document Retrieval [35.5697863674097]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset.
We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters.
Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Timestamping Documents and Beliefs [1.4467794332678539]
Document dating is a challenging problem which requires inference over the temporal structure of the document.
In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach.
We also propose AD3: Attentive Deep Document Dater, an attention-based document dating system.
arXiv Detail & Related papers (2021-06-09T02:12:18Z) - Identifying Documents In-Scope of a Collection from Web Archives [37.34941845795198]
We study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document.
We focus our evaluation on three datasets that we created from three different Web archives.
Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
arXiv Detail & Related papers (2020-09-02T16:22:23Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.