Radically Lower Data-Labeling Costs for Visually Rich Document
Extraction Models
- URL: http://arxiv.org/abs/2210.16391v1
- Date: Fri, 28 Oct 2022 20:10:16 GMT
- Title: Radically Lower Data-Labeling Costs for Visually Rich Document
Extraction Models
- Authors: Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata
- Abstract summary: We propose Selective Labeling to simplify the labeling task.
We show through experiments that selective labeling can reduce the cost of acquiring labeled data by $10times$ with a negligible loss in accuracy.
- Score: 13.16696804867477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key bottleneck in building automatic extraction models for visually rich
documents like invoices is the cost of acquiring the several thousand
high-quality labeled documents that are needed to train a model with acceptable
accuracy. We propose Selective Labeling to simplify the labeling task to
provide "yes/no" labels for candidate extractions predicted by a model trained
on partially labeled documents. We combine this with a custom active learning
strategy to find the predictions that the model is most uncertain about. We
show through experiments on document types drawn from 3 different domains that
selective labeling can reduce the cost of acquiring labeled data by $10\times$
with a negligible loss in accuracy.
Related papers
- Auto-Labeling Data for Object Detection [20.557988700343373]
This paper addresses the problem of training standard object detection models without any ground truth labels.<n>We generate application-specific pseudo "ground truth" labels using vision-language foundation models.<n>We find that our approach is a viable alternative to standard labeling in that it maintains competitive performance on multiple datasets.
arXiv Detail & Related papers (2025-06-03T01:27:56Z) - An Efficient Deep Learning-Based Approach to Automating Invoice Document Validation [0.0]
We propose to automate the validation of machine written invoices using document layout analysis and object detection techniques.
We introduce a novel dataset consisting of manually annotated real-world invoices and a multi-criteria validation process.
arXiv Detail & Related papers (2025-03-15T21:33:00Z) - Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation [0.2302001830524133]
We propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels.
We fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation.
We show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM.
arXiv Detail & Related papers (2024-11-22T14:16:09Z) - Fine-tuning Vision Classifiers On A Budget [1.688687464836377]
We show that using a simple naive-Bayes model to estimate the true labels allows us to label more data on a fixed budget without compromising label or fine-tuning quality.
We present experiments on a dataset of industrial images that demonstrates that our method, called Ground Truth Extension (GTX), enables fine-tuning ML models using fewer human labels.
arXiv Detail & Related papers (2024-09-30T17:54:38Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - ScarceNet: Animal Pose Estimation with Scarce Annotations [74.48263583706712]
ScarceNet is a pseudo label-based approach to generate artificial labels for the unlabeled images.
We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin.
arXiv Detail & Related papers (2023-03-27T09:15:53Z) - Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks.
We then tailor the labeling model specifically to the task of entity matching.
We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z) - Contextual Active Model Selection [10.925932167673764]
We present an approach to actively select pre-trained models while minimizing labeling costs.
The objective is to adaptively select the best model to make a prediction while limiting label requests.
We propose CAMS, a contextual active model selection algorithm that relies on two novel components.
arXiv Detail & Related papers (2022-07-13T08:22:22Z) - Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators.
We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - Learning with Noisy Labels by Targeted Relabeling [52.0329205268734]
Crowdsourcing platforms are often used to collect datasets for training deep neural networks.
We propose an approach which reserves a fraction of annotations to explicitly relabel highly probable labeling errors.
arXiv Detail & Related papers (2021-10-15T20:37:29Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - Active Learning from Crowd in Document Screening [76.9545252341746]
We focus on building a set of machine learning classifiers that evaluate documents, and then screen them efficiently.
We propose a multi-label active learning screening specific sampling technique -- objective-aware sampling.
We demonstrate that objective-aware sampling significantly outperforms the state of the art active learning sampling strategies.
arXiv Detail & Related papers (2020-11-11T16:17:28Z) - Active Learning for Noisy Data Streams Using Weak and Strong Labelers [3.9370369973510746]
We consider a novel weak and strong labeler problem inspired by humans natural ability for labeling.
We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection.
We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence.
arXiv Detail & Related papers (2020-10-27T09:18:35Z) - Few-shot Learning for Multi-label Intent Detection [59.66787898744991]
State-of-the-art work estimates label-instance relevance scores and uses a threshold to select multiple associated intent labels.
Experiments on two datasets show that the proposed model significantly outperforms strong baselines in both one-shot and five-shot settings.
arXiv Detail & Related papers (2020-10-11T14:42:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.