InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
Understanding with Instructions
- URL: http://arxiv.org/abs/2401.13313v1
- Date: Wed, 24 Jan 2024 09:09:37 GMT
- Title: InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
Understanding with Instructions
- Authors: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
- Abstract summary: InstructDoc is the first large-scale collection of 30 publicly available visual document understanding datasets.
InstructDr connects document images, image encoders, and large language models (LLMs) through a trainable bridging module.
Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions.
- Score: 30.609533589284634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of completing various visual document understanding
(VDU) tasks, e.g., question answering and information extraction, on real-world
documents through human-written instructions. To this end, we propose
InstructDoc, the first large-scale collection of 30 publicly available VDU
datasets, each with diverse instructions in a unified format, which covers a
wide range of 12 tasks and includes open document types/formats. Furthermore,
to enhance the generalization performance on VDU tasks, we design a new
instruction-based document reading and understanding model, InstructDr, that
connects document images, image encoders, and large language models (LLMs)
through a trainable bridging module. Experiments demonstrate that InstructDr
can effectively adapt to new VDU datasets, tasks, and domains via given
instructions and outperforms existing multimodal LLMs and ChatGPT without
specific training.
Related papers
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - HRVDA: High-Resolution Visual Document Assistant [32.51417315241559]
We propose a High-Resolution Visual Document Assistant (HRVDA) to bridge the gap between MLLMs and visual document understanding.
HRVDA employs a content filtering mechanism and an instruction filtering module to filter out the content-agnostic visual tokens and instruction-agnostic visual tokens.
Our model achieves state-of-the-art performance across multiple document understanding datasets.
arXiv Detail & Related papers (2024-04-10T11:10:50Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.