M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout,
Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout
Analysis
- URL: http://arxiv.org/abs/2305.08719v2
- Date: Sun, 21 May 2023 14:22:39 GMT
- Title: M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout,
Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout
Analysis
- Authors: Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu,
Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin
- Abstract summary: This paper introduces a large and diverse document layout analysis dataset called $M6Doc$.
We propose a transformer-based document layout analysis method called TransDLANet.
We conduct a comprehensive evaluation of $M6Doc$ with various layout analysis methods and demonstrate its effectiveness.
- Score: 23.924144353511984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document layout analysis is a crucial prerequisite for document
understanding, including document retrieval and conversion. Most public
datasets currently contain only PDF documents and lack realistic documents.
Models trained on these datasets may not generalize well to real-world
scenarios. Therefore, this paper introduces a large and diverse document layout
analysis dataset called $M^{6}Doc$. The $M^6$ designation represents six
properties: (1) Multi-Format (including scanned, photographed, and PDF
documents); (2) Multi-Type (such as scientific articles, textbooks, books, test
papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular,
Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language
(Chinese and English); (5) Multi-Annotation Category (74 types of annotation
labels with 237,116 annotation instances in 9,080 manually annotated pages);
and (6) Modern documents. Additionally, we propose a transformer-based document
layout analysis method called TransDLANet, which leverages an adaptive element
matching mechanism that enables query embedding to better match ground truth to
improve recall, and constructs a segmentation branch for more precise document
image instance segmentation. We conduct a comprehensive evaluation of
$M^{6}Doc$ with various layout analysis methods and demonstrate its
effectiveness. TransDLANet achieves state-of-the-art performance on $M^{6}Doc$
with 64.5% mAP. The $M^{6}Doc$ dataset will be available at
https://github.com/HCIILAB/M6Doc.
Related papers
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and
Understanding [8.855033708082832]
We introduce FATURA, a pivotal resource for researchers in the field of document analysis and understanding.
FATURA is a highly diverse dataset featuring multi- annotated invoice document images.
We provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios.
arXiv Detail & Related papers (2023-11-20T15:51:14Z) - A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - Vision Grid Transformer for Document Layout Analysis [26.62857594455592]
We present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding.
Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on document layout analysis tasks.
arXiv Detail & Related papers (2023-08-29T02:09:56Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [2.9923891863939938]
Document layout analysis is a key requirement for high-quality PDF document conversion.
Deep-learning models have proven to be very effective at layout detection and segmentation.
We present textitDocLayNet, a new, publicly available, document- annotation dataset.
arXiv Detail & Related papers (2022-06-02T14:25:12Z) - Multi-View Document Representation Learning for Open-Domain Dense
Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework.
It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries.
Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.