MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis
- URL: http://arxiv.org/abs/2107.00396v1
- Date: Thu, 1 Jul 2021 12:14:17 GMT
- Title: MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis
- Authors: Konstantin Bulatov, Ekaterina Emelianova, Daniil Tropin, Natalya
Skoryukina, Yulia Chernyshova, Alexander Sheshkus, Sergey Usilin, Zuheng
Ming, Jean-Christophe Burie, Muhammad Muzzamil Luqman, Vladimir V. Arlazarov
- Abstract summary: MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents.
With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
- Score: 48.35030471041193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identity documents recognition is an important sub-field of document
analysis, which deals with tasks of robust document detection, type
identification, text fields recognition, as well as identity fraud prevention
and document authenticity validation given photos, scans, or video frames of an
identity document capture. Significant amount of research has been published on
this topic in recent years, however a chief difficulty for such research is
scarcity of datasets, due to the subject matter being protected by security
requirements. A few datasets of identity documents which are available lack
diversity of document types, capturing conditions, or variability of document
field values. In addition, the published datasets were typically designed only
for a subset of document recognition problems, not for a complex identity
document analysis. In this paper, we present a dataset MIDV-2020 which consists
of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock
identity documents, each with unique text field values and unique artificially
generated faces, with rich annotation. For the presented benchmark dataset
baselines are provided for such tasks as document location and identification,
text fields recognition, and face detection. With 72409 annotated images in
total, to the date of publication the proposed dataset is the largest publicly
available identity documents dataset with variable artificially generated data,
and we believe that it will prove invaluable for advancement of the field of
document analysis and recognition. The dataset is available for download at
ftp://smartengines.com/midv-2020 and http://l3i-share.univ-lr.fr .
Related papers
- LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents [2.697503433221448]
We introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field.
Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge.
This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
arXiv Detail & Related papers (2024-11-22T14:21:18Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection [25.980165854663145]
IDNet is a benchmark dataset designed to advance privacy-preserving fraud detection efforts.
It comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes.
We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods.
arXiv Detail & Related papers (2024-08-03T07:05:40Z) - DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis [0.0]
Identity document (ID) image analysis has become essential for many online services, like bank account opening or insurance subscription.
There are only a few available to benchmark ID analysis methods, mainly because of privacy restrictions, security requirements and legal reasons.
We present the DocXPand-25k dataset, which consists of 24,994 richly labeled IDs images.
arXiv Detail & Related papers (2024-07-30T08:55:27Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - Synthetic dataset of ID and Travel Document [1.9296797946506603]
This paper presents a new synthetic dataset of ID and travel documents, called SIDTD.
The SIDTD dataset is created to help training and evaluating forged ID documents detection systems.
arXiv Detail & Related papers (2024-01-03T18:06:28Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Source Printer Identification from Document Images Acquired using
Smartphone [14.889347839830092]
We propose to learn a single CNN model from the fusion of letter images and their printer-specific noise residuals.
The proposed method achieves 98.42% document classification accuracy using images of letter 'e' under a 5x2 cross-validation approach.
arXiv Detail & Related papers (2020-03-27T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.