StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training
- URL: http://arxiv.org/abs/2303.00289v1
- Date: Wed, 1 Mar 2023 07:32:51 GMT
- Title: StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training
- Authors: Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo,
Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
- Abstract summary: StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
- Score: 64.37272287179661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present StrucTexTv2, an effective document image
pre-training framework, by performing masked visual-textual prediction. It
consists of two self-supervised pre-training tasks: masked image modeling and
masked language modeling, based on text region-level image masking. The
proposed method randomly masks some image regions according to the bounding box
coordinates of text words. The objectives of our pre-training tasks are
reconstructing the pixels of masked image regions and the corresponding masked
tokens simultaneously. Hence the pre-trained encoder can capture more textual
semantics in comparison to the masked image modeling that usually predicts the
masked image patches. Compared to the masked multi-modal modeling methods for
document image understanding that rely on both the image and text modalities,
StrucTexTv2 models image-only input and potentially deals with more application
scenarios free from OCR pre-processing. Extensive experiments on mainstream
benchmarks of document image understanding demonstrate the effectiveness of
StrucTexTv2. It achieves competitive or even new state-of-the-art performance
in various downstream tasks such as image classification, layout analysis,
table structure recognition, document OCR, and information extraction under the
end-to-end scenario.
Related papers
- DiffSTR: Controlled Diffusion Models for Scene Text Removal [5.790630195329777]
Scene Text Removal (STR) aims to prevent unauthorized use of text in images.
STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows.
We introduce a ControlNet diffusion model, treating STR as an inpainting task.
We develop a mask pretraining pipeline to condition our diffusion model.
arXiv Detail & Related papers (2024-10-29T04:20:21Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training
for Document Understanding [7.7514466231699455]
This paper proposes a novel multi-modal pre-training model, LayoutMask.
It can enhance the interactions between text and layout modalities in a unified model.
It can achieve state-of-the-art results on a wide variety of VrDU problems.
arXiv Detail & Related papers (2023-05-30T03:56:07Z) - MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling.
We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image.
Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.