DiT: Self-supervised Pre-training for Document Image Transformer
- URL: http://arxiv.org/abs/2203.02378v1
- Date: Fri, 4 Mar 2022 15:34:46 GMT
- Title: DiT: Self-supervised Pre-training for Document Image Transformer
- Authors: Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei
- Abstract summary: We propose DiT, a self-supervised pre-trained Document Image Transformer model.
We leverage DiT as the backbone network in a variety of vision-based Document AI tasks.
Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results.
- Score: 85.78807512344463
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image Transformer has recently achieved significant progress for natural
image understanding, either using supervised (ViT, DeiT, etc.) or
self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we
propose DiT, a self-supervised pre-trained Document Image Transformer model
using large-scale unlabeled text images for Document AI tasks, which is
essential since no supervised counterparts ever exist due to the lack of human
labeled document images. We leverage DiT as the backbone network in a variety
of vision-based Document AI tasks, including document image classification,
document layout analysis, as well as table detection. Experiment results have
illustrated that the self-supervised pre-trained DiT model achieves new
state-of-the-art results on these downstream tasks, e.g. document image
classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0
$\rightarrow$ 94.9) and table detection (94.23 $\rightarrow$ 96.55). The code
and pre-trained models are publicly available at \url{https://aka.ms/msdit}.
Related papers
- Vision Grid Transformer for Document Layout Analysis [26.62857594455592]
We present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding.
Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on document layout analysis tasks.
arXiv Detail & Related papers (2023-08-29T02:09:56Z) - DocMAE: Document Image Rectification via Self-supervised Representation
Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification.
We first mask random patches of the background-excluded document images and then reconstruct the missing pixels.
With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - LiT: Zero-Shot Transfer with Locked-image Text Tuning [68.78877201319811]
"Locked-image Text tuning" (LiT-tuning) teaches a text model to read out good representations from a pre-trained image model for new tasks.
A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval.
arXiv Detail & Related papers (2021-11-15T18:53:48Z) - Multiple Document Datasets Pre-training Improves Text Line Detection
With Deep Neural Networks [2.5352713493505785]
We introduce a fully convolutional network for the document layout analysis task.
Our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents.
We show that Doc-UFCN outperforms state-of-the-art methods on various datasets.
arXiv Detail & Related papers (2020-12-28T09:48:33Z) - Self-Supervised Representation Learning on Document Images [8.927538538637783]
We show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information.
We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task.
arXiv Detail & Related papers (2020-04-18T10:14:06Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.