LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language
Models
- URL: http://arxiv.org/abs/2309.09506v2
- Date: Tue, 19 Sep 2023 07:47:52 GMT
- Title: LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language
Models
- Authors: Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan
- Abstract summary: We propose a model that treats layout generation as a code generation task to enhance semantic information.
We develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules.
We attain significant state-of-the-art performance on multiple datasets.
- Score: 84.16541551923221
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphic layout generation, a growing research field, plays a significant role
in user engagement and information perception. Existing methods primarily treat
layout generation as a numerical optimization task, focusing on quantitative
aspects while overlooking the semantic information of layout, such as the
relationship between each layout element. In this paper, we propose LayoutNUWA,
the first model that treats layout generation as a code generation task to
enhance semantic information and harness the hidden layout expertise of large
language models~(LLMs). More concretely, we develop a Code Instruct Tuning
(CIT) approach comprising three interconnected modules: 1) the Code
Initialization (CI) module quantifies the numerical conditions and initializes
them as HTML code with strategically placed masks; 2) the Code Completion (CC)
module employs the formatting knowledge of LLMs to fill in the masked portions
within the HTML code; 3) the Code Rendering (CR) module transforms the
completed code into the final layout output, ensuring a highly interpretable
and transparent layout generation procedure that directly maps code to a
visualized layout. We attain significant state-of-the-art performance (even
over 50\% improvements) on multiple datasets, showcasing the strong
capabilities of LayoutNUWA. Our code is available at
https://github.com/ProjectNUWA/LayoutNUWA.
Related papers
- GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts.
We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously.
To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets.
This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding [21.916774808384893]
The proposed layout instruction tuning strategy consists of two components: layout-aware Pre-training and layout-aware Supervised Finetuning.
Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding.
arXiv Detail & Related papers (2024-04-08T06:40:28Z) - KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction [59.039355258637315]
We propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation.
KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes.
KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning.
arXiv Detail & Related papers (2024-03-12T14:56:34Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding [17.179384053140236]
Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models.
We propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document.
We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion.
arXiv Detail & Related papers (2021-04-16T23:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.