Synthetic dataset of ID and Travel Document
- URL: http://arxiv.org/abs/2401.01858v1
- Date: Wed, 3 Jan 2024 18:06:28 GMT
- Title: Synthetic dataset of ID and Travel Document
- Authors: Carlos Boned and Maxime Talarmain and Nabil Ghanmi and Guillaume
Chiron and Sanket Biswas and Ahmad Montaser Awal and Oriol Ramos Terrades
- Abstract summary: This paper presents a new synthetic dataset of ID and travel documents, called SIDTD.
The SIDTD dataset is created to help training and evaluating forged ID documents detection systems.
- Score: 1.9296797946506603
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a new synthetic dataset of ID and travel documents,
called SIDTD. The SIDTD dataset is created to help training and evaluating
forged ID documents detection systems. Such a dataset has become a necessity as
ID documents contain personal information and a public dataset of real
documents can not be released. Moreover, forged documents are scarce, compared
to legit ones, and the way they are generated varies from one fraudster to
another resulting in a class of high intra-variability. In this paper we
trained state-of-the-art models on this dataset and we compare them to the
performance achieved in larger, but private, datasets. The creation of this
dataset will help to document image analysis community to progress in the task
of ID document verification.
Related papers
- Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities.
We also use matching spatial to address the one-to-one mapping property that exists in many documents.
Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z) - DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing.
Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity.
We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources.
Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z) - ACID: Abstractive, Content-Based IDs for Document Retrieval with
Language Models [69.86170930261841]
We introduce ACID, in which each document's ID is composed of abstractive keyphrases generated by a large language model.
We show that using ACID improves top-10 and top-20 accuracy by 15.6% and 14.4% (relative)
Our results demonstrate the effectiveness of human-readable, natural-language IDs in generative retrieval with LMs.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - IncDSI: Incrementally Updatable Document Retrieval [32.89218578877908]
IncDSI is a method to add documents in real time without retraining the model on the entire dataset.
We formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters.
Our approach is competitive with re-training the model on the whole dataset.
arXiv Detail & Related papers (2023-07-19T07:20:30Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis [48.35030471041193]
MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents.
With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
arXiv Detail & Related papers (2021-07-01T12:14:17Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.