mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding
- URL: http://arxiv.org/abs/2307.02499v1
- Date: Tue, 4 Jul 2023 11:28:07 GMT
- Title: mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding
- Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan,
Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei
Huang
- Abstract summary: Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
- Score: 55.4806974284156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document understanding refers to automatically extract, analyze and
comprehend information from various types of digital documents, such as a web
page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl,
have demonstrated promising zero-shot capabilities in shallow OCR-free text
recognition, indicating their potential for OCR-free document understanding.
Nevertheless, without in-domain training, these models tend to ignore
fine-grained OCR features, such as sophisticated tables or large blocks of
text, which are essential for OCR-free document understanding. In this paper,
we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding.
Specifically, we first construct a instruction tuning dataset featuring a wide
range of visual-text understanding tasks. Then, we strengthen the OCR-free
document understanding ability by jointly train the model on language-only,
general vision-and-language, and document instruction tuning dataset with our
unified instruction tuning strategy. We also build an OCR-free document
instruction understanding evaluation set LLMDoc to better compare models'
capabilities on instruct compliance and document understanding. Experimental
results show that our model outperforms existing multi-modal models,
demonstrating its strong ability of document understanding. Besides, without
specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream
tasks. Our code, models, training data and evaluation set are available at
https://github.com/X-PLUG/mPLUG-DocOwl.
Related papers
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding [40.38251904765156]
Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content.
We introduce DocLayLLM, an efficient and effective multi-modal extension of large language models (LLMs) specifically designed for TDU.
arXiv Detail & Related papers (2024-08-27T13:13:38Z) - VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding [18.609441902943445]
VisFocus is an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt.
We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder.
Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance.
arXiv Detail & Related papers (2024-07-17T14:16:46Z) - Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model [25.459787361454353]
We present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning.
By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese)
arXiv Detail & Related papers (2024-07-03T12:04:10Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding [100.17063271791528]
We propose the Unified Structure Learning to boost the performance of MLLMs.
Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks.
arXiv Detail & Related papers (2024-03-19T16:48:40Z) - UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM)
By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters.
Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.