LAPDoc: Layout-Aware Prompting for Documents
- URL: http://arxiv.org/abs/2402.09841v1
- Date: Thu, 15 Feb 2024 10:00:49 GMT
- Title: LAPDoc: Layout-Aware Prompting for Documents
- Authors: Marcel Lamott, Yves-Noel Weweler, Adrian Ulges, Faisal Shafait, Dirk
Krechel, Darko Obradovic
- Abstract summary: We investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment.
Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15%.
- Score: 3.523208537466128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in training large language models (LLMs) using massive
amounts of solely textual data lead to strong generalization across many
domains and tasks, including document-specific tasks. Opposed to that there is
a trend to train multi-modal transformer architectures tailored for document
understanding that are designed specifically to fuse textual inputs with the
corresponding document layout. This involves a separate fine-tuning step for
which additional training data is required. At present, no document
transformers with comparable generalization to LLMs are available That raises
the question which type of model is to be preferred for document understanding
tasks. In this paper we investigate the possibility to use purely text-based
LLMs for document-specific tasks by using layout enrichment. We explore drop-in
modifications and rule-based methods to enrich purely textual LLM prompts with
layout information. In our experiments we investigate the effects on the
commercial ChatGPT model and the open-source LLM Solar. We demonstrate that
using our approach both LLMs show improved performance on various standard
document benchmarks. In addition, we study the impact of noisy OCR and layout
errors, as well as the limitations of LLMs when it comes to utilizing document
layout. Our results indicate that layout enrichment can improve the performance
of purely text-based LLMs for document understanding by up to 15% compared to
just using plain document text. In conclusion, this approach should be
considered for the best model choice between text-based LLM or multi-modal
document transformers.
Related papers
- LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding [103.69014172427026]
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents.
We present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL) which broadens the capabilities of any LMM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z) - DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models [66.91204604417912]
This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs.
We present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge.
Experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach.
arXiv Detail & Related papers (2024-10-04T00:53:32Z) - DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding [40.38251904765156]
Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content.
We introduce DocLayLLM, an efficient and effective multi-modal extension of large language models (LLMs) specifically designed for TDU.
arXiv Detail & Related papers (2024-08-27T13:13:38Z) - DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models [9.232693392690702]
TextHawk is a document-oriented Multimodal Large Language Model (MLLM)
It is designed to explore efficient fine-grained perception by designing four dedicated components.
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-14T09:48:37Z) - LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding [21.916774808384893]
The proposed layout instruction tuning strategy consists of two components: layout-aware Pre-training and layout-aware Supervised Finetuning.
Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding.
arXiv Detail & Related papers (2024-04-08T06:40:28Z) - Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs)
We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues.
We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.