MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis
- URL: http://arxiv.org/abs/2107.00396v1
- Date: Thu, 1 Jul 2021 12:14:17 GMT
- Title: MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis
- Authors: Konstantin Bulatov, Ekaterina Emelianova, Daniil Tropin, Natalya
Skoryukina, Yulia Chernyshova, Alexander Sheshkus, Sergey Usilin, Zuheng
Ming, Jean-Christophe Burie, Muhammad Muzzamil Luqman, Vladimir V. Arlazarov
- Abstract summary: MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents.
With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
- Score: 48.35030471041193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identity documents recognition is an important sub-field of document
analysis, which deals with tasks of robust document detection, type
identification, text fields recognition, as well as identity fraud prevention
and document authenticity validation given photos, scans, or video frames of an
identity document capture. Significant amount of research has been published on
this topic in recent years, however a chief difficulty for such research is
scarcity of datasets, due to the subject matter being protected by security
requirements. A few datasets of identity documents which are available lack
diversity of document types, capturing conditions, or variability of document
field values. In addition, the published datasets were typically designed only
for a subset of document recognition problems, not for a complex identity
document analysis. In this paper, we present a dataset MIDV-2020 which consists
of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock
identity documents, each with unique text field values and unique artificially
generated faces, with rich annotation. For the presented benchmark dataset
baselines are provided for such tasks as document location and identification,
text fields recognition, and face detection. With 72409 annotated images in
total, to the date of publication the proposed dataset is the largest publicly
available identity documents dataset with variable artificially generated data,
and we believe that it will prove invaluable for advancement of the field of
document analysis and recognition. The dataset is available for download at
ftp://smartengines.com/midv-2020 and http://l3i-share.univ-lr.fr .
Related papers
- Synthetic dataset of ID and Travel Document [1.9296797946506603]
This paper presents a new synthetic dataset of ID and travel documents, called SIDTD.
The SIDTD dataset is created to help training and evaluating forged ID documents detection systems.
arXiv Detail & Related papers (2024-01-03T18:06:28Z) - FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and
Understanding [8.855033708082832]
We introduce FATURA, a pivotal resource for researchers in the field of document analysis and understanding.
FATURA is a highly diverse dataset featuring multi- annotated invoice document images.
We provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios.
arXiv Detail & Related papers (2023-11-20T15:51:14Z) - ACID: Abstractive, Content-Based IDs for Document Retrieval with
Language Models [69.86170930261841]
We introduce ACID, in which each document's ID is composed of abstractive keyphrases generated by a large language model.
We show that using ACID improves top-10 and top-20 accuracy by 15.6% and 14.4% (relative)
Our results demonstrate the effectiveness of human-readable, natural-language IDs in generative retrieval with LMs.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - Identity Documents Authentication based on Forgery Detection of
Guilloche Pattern [2.606834301724095]
An authentication model for identity documents based on forgery detection of guilloche patterns is proposed.
Experiments are conducted in order to analyze and identify the most proper parameters to achieve higher authentication performance.
arXiv Detail & Related papers (2022-06-22T11:37:10Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Source Printer Identification from Document Images Acquired using
Smartphone [14.889347839830092]
We propose to learn a single CNN model from the fusion of letter images and their printer-specific noise residuals.
The proposed method achieves 98.42% document classification accuracy using images of letter 'e' under a 5x2 cross-validation approach.
arXiv Detail & Related papers (2020-03-27T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.