LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents
- URL: http://arxiv.org/abs/2411.14962v1
- Date: Fri, 22 Nov 2024 14:21:18 GMT
- Title: LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents
- Authors: Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Karan Gupta, Priyaranjan Pattnayak,
- Abstract summary: We introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field.
Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge.
This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
- Score: 2.697503433221448
- License:
- Abstract: Accurate barcode detection and decoding in Identity documents is crucial for applications like security, healthcare, and education, where reliable data extraction and verification are essential. However, building robust detection models is challenging due to the lack of diverse, realistic datasets an issue often tied to privacy concerns and the wide variety of document formats. Traditional tools like Faker rely on predefined templates, making them less effective for capturing the complexity of real-world identity documents. In this paper, we introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field. Using the vast knowledge LLMs have about different documents and content, our method creates data that reflects the variety found in real identity documents. This data is then encoded into barcode and overlayed on templates for documents such as Driver's licenses, Insurance cards, Student IDs. Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge or predefined fields. Compared to traditional methods like Faker, data generated by LLM demonstrates greater diversity and contextual relevance, leading to improved performance in barcode detection models. This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
Related papers
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis [0.0]
Identity document (ID) image analysis has become essential for many online services, like bank account opening or insurance subscription.
There are only a few available to benchmark ID analysis methods, mainly because of privacy restrictions, security requirements and legal reasons.
We present the DocXPand-25k dataset, which consists of 24,994 richly labeled IDs images.
arXiv Detail & Related papers (2024-07-30T08:55:27Z) - Synthetic dataset of ID and Travel Document [1.9296797946506603]
This paper presents a new synthetic dataset of ID and travel documents, called SIDTD.
The SIDTD dataset is created to help training and evaluating forged ID documents detection systems.
arXiv Detail & Related papers (2024-01-03T18:06:28Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Identity Documents Authentication based on Forgery Detection of
Guilloche Pattern [2.606834301724095]
An authentication model for identity documents based on forgery detection of guilloche patterns is proposed.
Experiments are conducted in order to analyze and identify the most proper parameters to achieve higher authentication performance.
arXiv Detail & Related papers (2022-06-22T11:37:10Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis [48.35030471041193]
MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents.
With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
arXiv Detail & Related papers (2021-07-01T12:14:17Z) - Generating Synthetic Handwritten Historical Documents With OCR
Constrained GANs [2.3808546906079178]
We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images.
We demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.
arXiv Detail & Related papers (2021-03-15T09:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.