MarkupLM: Pre-training of Text and Markup Language for Visually-rich
Document Understanding
- URL: http://arxiv.org/abs/2110.08518v1
- Date: Sat, 16 Oct 2021 09:17:28 GMT
- Title: MarkupLM: Pre-training of Text and Markup Language for Visually-rich
Document Understanding
- Authors: Junlong Li, Yiheng Xu, Lei Cui, Furu Wei
- Abstract summary: Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU)
We propose MarkupLM for document understanding tasks with markup languages as the backbone.
Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
- Score: 35.35388421383703
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal pre-training with text, layout, and image has made significant
progress for Visually-rich Document Understanding (VrDU), especially the
fixed-layout documents such as scanned document images. While, there are still
a large number of digital documents where the layout information is not fixed
and needs to be interactively and dynamically rendered for visualization,
making existing layout-based pre-training approaches not easy to apply. In this
paper, we propose MarkupLM for document understanding tasks with markup
languages as the backbone such as HTML/XML-based documents, where text and
markup information is jointly pre-trained. Experiment results show that the
pre-trained MarkupLM significantly outperforms the existing strong baseline
models on several document understanding tasks. The pre-trained model and code
will be publicly available at https://aka.ms/markuplm.
Related papers
- Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents [18.080447065002392]
We propose DocumentCLIP to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.
Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content.
arXiv Detail & Related papers (2023-06-09T23:51:11Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.