Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation
- URL: http://arxiv.org/abs/2411.14957v2
- Date: Mon, 25 Nov 2024 09:47:20 GMT
- Title: Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation
- Authors: Aniket Bhattacharyya, Anurag Tripathi,
- Abstract summary: We propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels.
We fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation.
We show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM.
- Score: 0.2302001830524133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogeneous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation without using the teacher model's weights or training dataset to conditionally generate annotations in the appropriate format. Using a benchmark external dataset where ground truth labels are available, we demonstrate conditions under which our approach performs at par with Claude 3 Sonnet through empirical studies. We then show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and ~5X faster, and outperforms layout-aware baselines by more than 10% in Average Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason and extract information from rare formats. Finally, we illustrate the usage of our approach in overpayment prevention.
Related papers
- DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion [5.342168661302001]
We propose a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs)<n>Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset.<n>We show that our framework achieves on average $87%$ of the performance of the full real-world dataset.
arXiv Detail & Related papers (2026-02-25T11:52:13Z) - Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments [0.25289250870065627]
Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information.<n>We propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments.<n>Our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores.
arXiv Detail & Related papers (2025-05-18T15:49:17Z) - Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance [30.879299174443812]
In this work, we forgo real documents and annotations and use large language models to generate synthetic documents.<n>Our experiments on MS MARCO and BEIR benchmark show that our proposed approach outperforms conventional training with InfoNCE by a large margin.
arXiv Detail & Related papers (2025-03-29T22:33:22Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Zero-Shot Learning Over Large Output Spaces : Utilizing Indirect Knowledge Extraction from Large Language Models [3.908992369351976]
Extreme Zero-shot XMC (EZ-XMC) is a special setting of XMC wherein no supervision is provided.
Traditional state-of-the-art methods extract pseudo labels from the document title or segments.
We propose a framework to train a small bi-encoder model via the feedback from the large language model (LLM)
arXiv Detail & Related papers (2024-06-13T16:26:37Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks.
We then tailor the labeling model specifically to the task of entity matching.
We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z) - Radically Lower Data-Labeling Costs for Visually Rich Document
Extraction Models [13.16696804867477]
We propose Selective Labeling to simplify the labeling task.
We show through experiments that selective labeling can reduce the cost of acquiring labeled data by $10times$ with a negligible loss in accuracy.
arXiv Detail & Related papers (2022-10-28T20:10:16Z) - Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators.
We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Active Learning for Noisy Data Streams Using Weak and Strong Labelers [3.9370369973510746]
We consider a novel weak and strong labeler problem inspired by humans natural ability for labeling.
We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection.
We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence.
arXiv Detail & Related papers (2020-10-27T09:18:35Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - An Empirical Study on Large-Scale Multi-Label Text Classification
Including Few and Zero-Shot Labels [49.036212158261215]
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications.
Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs)
We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs.
We propose a new state-of-the-art method which combines BERT with LWANs.
arXiv Detail & Related papers (2020-10-04T18:55:47Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z) - Robust Layout-aware IE for Visually Rich Documents with Pre-trained
Language Models [23.42593796135709]
We study the problem of information extraction from visually rich documents (VRDs)
We present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents.
arXiv Detail & Related papers (2020-05-22T06:04:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.