Related papers: Synthetic dataset of ID and Travel Document

Synthetic dataset of ID and Travel Document

URL: http://arxiv.org/abs/2401.01858v1
Date: Wed, 3 Jan 2024 18:06:28 GMT
Title: Synthetic dataset of ID and Travel Document
Authors: Carlos Boned and Maxime Talarmain and Nabil Ghanmi and Guillaume Chiron and Sanket Biswas and Ahmad Montaser Awal and Oriol Ramos Terrades
Abstract summary: This paper presents a new synthetic dataset of ID and travel documents, called SIDTD. The SIDTD dataset is created to help training and evaluating forged ID documents detection systems.
Score: 1.9296797946506603
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents a new synthetic dataset of ID and travel documents, called SIDTD. The SIDTD dataset is created to help training and evaluating forged ID documents detection systems. Such a dataset has become a necessity as ID documents contain personal information and a public dataset of real documents can not be released. Moreover, forged documents are scarce, compared to legit ones, and the way they are generated varies from one fraudster to another resulting in a class of high intra-variability. In this paper we trained state-of-the-art models on this dataset and we compare them to the performance achieved in larger, but private, datasets. The creation of this dataset will help to document image analysis community to progress in the task of ID document verification.

Related papers

SynID: Passport Synthetic Dataset for Presentation Attack Detection [7.1212970088491385]
Increase is driven by several factors, including the rise of remote work, online purchasing, migration, and advancements in synthetic images.<n>This work proposes a new passport dataset generated from a hybrid method that combines synthetic data and open-access information.
arXiv Detail & Related papers (2025-05-12T13:24:54Z)
LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents [2.697503433221448]
We introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field. Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge. This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
arXiv Detail & Related papers (2024-11-22T14:21:18Z)
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z)
Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$) GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z)
IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection [25.980165854663145]
IDNet is a benchmark dataset designed to advance privacy-preserving fraud detection efforts. It comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods.
arXiv Detail & Related papers (2024-08-03T07:05:40Z)
DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing. Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity. We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z)
MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis [48.35030471041193]
MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents. With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
arXiv Detail & Related papers (2021-07-01T12:14:17Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.