DoSA : A System to Accelerate Annotations on Business Documents with
Human-in-the-Loop
- URL: http://arxiv.org/abs/2211.04934v1
- Date: Wed, 9 Nov 2022 15:04:07 GMT
- Title: DoSA : A System to Accelerate Annotations on Business Documents with
Human-in-the-Loop
- Authors: Neelesh K Shukla, Msp Raja, Raghu Katikeri, Amit Vaid
- Abstract summary: DoSA (Document Specific Automated s) helps annotators in generating initial annotations automatically using our novel bootstrap approach.
An open-source ready-to-use implementation is made available on GitHub.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Business documents come in a variety of structures, formats and information
needs which makes information extraction a challenging task. Due to these
variations, having a document generic model which can work well across all
types of documents and for all the use cases seems far-fetched. For
document-specific models, we would need customized document-specific labels. We
introduce DoSA (Document Specific Automated Annotations), which helps
annotators in generating initial annotations automatically using our novel
bootstrap approach by leveraging document generic datasets and models. These
initial annotations can further be reviewed by a human for correctness. An
initial document-specific model can be trained and its inference can be used as
feedback for generating more automated annotations. These automated annotations
can be reviewed by human-in-the-loop for the correctness and a new improved
model can be trained using the current model as pre-trained model before going
for the next iteration. In this paper, our scope is limited to Form like
documents due to limited availability of generic annotated datasets, but this
idea can be extended to a variety of other documents as more datasets are
built. An open-source ready-to-use implementation is made available on GitHub
https://github.com/neeleshkshukla/DoSA.
Related papers
- Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities.
We also use matching spatial to address the one-to-one mapping property that exists in many documents.
Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - IncDSI: Incrementally Updatable Document Retrieval [35.5697863674097]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset.
We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters.
Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Does Recommend-Revise Produce Reliable Annotations? An Analysis on
Missing Instances in DocRED [60.39125850987604]
We show that a textit-revise scheme results in false negative samples and an obvious bias towards popular entities and relations.
The relabeled dataset is released to serve as a more reliable test set of document RE models.
arXiv Detail & Related papers (2022-04-17T11:29:01Z) - Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements.
We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.