Large Language Models Understand Layout
- URL: http://arxiv.org/abs/2407.05750v3
- Date: Wed, 28 Aug 2024 02:04:24 GMT
- Title: Large Language Models Understand Layout
- Authors: Weiming Li, Manni Duan, Dong An, Yan Shao,
- Abstract summary: Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks.
We show that, beyond text understanding capability, LLMs are capable of processing text layouts denoted by spatial markers.
We show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
- Score: 6.732578061359833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
Related papers
- LLaSA: Large Language and Structured Data Assistant [27.452536284165273]
Graph Neutral Networks (GNNs) have been introduced as an additional modality into the input of Large Language Models (LLMs)
We propose textbfLarge textbfLanguage and textbfStructured Data textbfAssistant (LLaSA) to enhance LLMs' ability to handle structured data.
arXiv Detail & Related papers (2024-11-16T12:27:14Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)
We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.
Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [59.07490387145391]
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks.
Their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language.
We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories.
arXiv Detail & Related papers (2024-01-12T12:10:28Z) - Graph Neural Prompting with Large Language Models [32.97391910476073]
Graph Neural Prompting (GNP) is a novel plug-and-play method to assist pre-trained language models in learning beneficial knowledge from knowledge graphs.
Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks.
arXiv Detail & Related papers (2023-09-27T06:33:29Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Using Large Language Models to Generate Engaging Captions for Data
Visualizations [51.98253121636079]
Large language models (LLM) use sophisticated deep learning technology to produce human-like prose.
Key challenge lies in designing the most effective prompt for the LLM, a task called prompt engineering.
We report on first experiments using the popular LLM GPT-3 and deliver some promising results.
arXiv Detail & Related papers (2022-12-27T23:56:57Z) - SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning [10.810615375345511]
This paper proposes a benchmark for spatial reasoning on natural language text.
We design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs.
Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs' capability on spatial understanding.
arXiv Detail & Related papers (2021-04-12T21:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.