Related papers: Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

URL: http://arxiv.org/abs/2210.16391v1
Date: Fri, 28 Oct 2022 20:10:16 GMT
Title: Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models
Authors: Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata
Abstract summary: We propose Selective Labeling to simplify the labeling task. We show through experiments that selective labeling can reduce the cost of acquiring labeled data by $10times$ with a negligible loss in accuracy.
Score: 13.16696804867477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ with a negligible loss in accuracy.

Related papers

Auto-Labeling Data for Object Detection [20.557988700343373]
This paper addresses the problem of training standard object detection models without any ground truth labels.<n>We generate application-specific pseudo "ground truth" labels using vision-language foundation models.<n>We find that our approach is a viable alternative to standard labeling in that it maintains competitive performance on multiple datasets.
arXiv Detail & Related papers (2025-06-03T01:27:56Z)
An Efficient Deep Learning-Based Approach to Automating Invoice Document Validation [0.0]
We propose to automate the validation of machine written invoices using document layout analysis and object detection techniques. We introduce a novel dataset consisting of manually annotated real-world invoices and a multi-criteria validation process.
arXiv Detail & Related papers (2025-03-15T21:33:00Z)
Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation [0.2302001830524133]
We propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels. We fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation. We show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM.
arXiv Detail & Related papers (2024-11-22T14:16:09Z)
Fine-tuning Vision Classifiers On A Budget [1.688687464836377]
We show that using a simple naive-Bayes model to estimate the true labels allows us to label more data on a fixed budget without compromising label or fine-tuning quality. We present experiments on a dataset of industrial images that demonstrates that our method, called Ground Truth Extension (GTX), enables fine-tuning ML models using fewer human labels.
arXiv Detail & Related papers (2024-09-30T17:54:38Z)
Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget. Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z)
ScarceNet: Animal Pose Estimation with Scarce Annotations [74.48263583706712]
ScarceNet is a pseudo label-based approach to generate artificial labels for the unlabeled images. We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin.
arXiv Detail & Related papers (2023-03-27T09:15:53Z)
Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks. We then tailor the labeling model specifically to the task of entity matching. We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z)
Contextual Active Model Selection [10.925932167673764]
We present an approach to actively select pre-trained models while minimizing labeling costs. The objective is to adaptively select the best model to make a prediction while limiting label requests. We propose CAMS, a contextual active model selection algorithm that relies on two novel components.
arXiv Detail & Related papers (2022-07-13T08:22:22Z)
Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators. We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z)
Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL) CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z)
Learning with Noisy Labels by Targeted Relabeling [52.0329205268734]
Crowdsourcing platforms are often used to collect datasets for training deep neural networks. We propose an approach which reserves a fraction of annotations to explicitly relabel highly probable labeling errors.
arXiv Detail & Related papers (2021-10-15T20:37:29Z)
Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images. We propose modifications and best practices aimed at minimizing human labeling effort. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z)
Active Learning from Crowd in Document Screening [76.9545252341746]
We focus on building a set of machine learning classifiers that evaluate documents, and then screen them efficiently. We propose a multi-label active learning screening specific sampling technique -- objective-aware sampling. We demonstrate that objective-aware sampling significantly outperforms the state of the art active learning sampling strategies.
arXiv Detail & Related papers (2020-11-11T16:17:28Z)
Active Learning for Noisy Data Streams Using Weak and Strong Labelers [3.9370369973510746]
We consider a novel weak and strong labeler problem inspired by humans natural ability for labeling. We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection. We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence.
arXiv Detail & Related papers (2020-10-27T09:18:35Z)
Few-shot Learning for Multi-label Intent Detection [59.66787898744991]
State-of-the-art work estimates label-instance relevance scores and uses a threshold to select multiple associated intent labels. Experiments on two datasets show that the proposed model significantly outperforms strong baselines in both one-shot and five-shot settings.
arXiv Detail & Related papers (2020-10-11T14:42:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.