A Simple yet Effective Layout Token in Large Language Models for Document Understanding
- URL: http://arxiv.org/abs/2503.18434v1
- Date: Mon, 24 Mar 2025 08:32:54 GMT
- Title: A Simple yet Effective Layout Token in Large Language Models for Document Understanding
- Authors: Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, Ji Zhang,
- Abstract summary: LayTokenLLM represents layout information as a single token per text segment.<n>It shares position IDs between text and layout tokens, eliminating the need for additional position IDs.
- Score: 13.40043973365106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.
Related papers
- Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding [42.15416804253783]
Multi-modal Large Language Models (MLLMs) endow large language models with visual comprehension capabilities.
How to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored.
We introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task.
arXiv Detail & Related papers (2025-03-18T11:07:14Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)<n>Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z) - GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts.
We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously.
To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding [30.754200683466788]
We introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM) for document understanding.
LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues.
It also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA)
arXiv Detail & Related papers (2024-07-02T06:29:05Z) - LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding [21.916774808384893]
The proposed layout instruction tuning strategy consists of two components: layout-aware Pre-training and layout-aware Supervised Finetuning.
Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding.
arXiv Detail & Related papers (2024-04-08T06:40:28Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Knowing Where and What: Unified Word Block Pretraining for Document
Understanding [11.46378901674016]
We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
arXiv Detail & Related papers (2022-07-28T09:43:06Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.