Related papers: LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

URL: http://arxiv.org/abs/2309.09506v2
Date: Tue, 19 Sep 2023 07:47:52 GMT
Title: LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models
Authors: Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan
Abstract summary: We propose a model that treats layout generation as a code generation task to enhance semantic information. We develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules. We attain significant state-of-the-art performance on multiple datasets.
Score: 84.16541551923221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50\% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at https://github.com/ProjectNUWA/LayoutNUWA.

Related papers

Relation-Rich Visual Document Generator for Visual Information Extraction [12.4941229258054]
We propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach. Our method significantly enhances the performance of document understanding models on various VIE benchmarks.
arXiv Detail & Related papers (2025-04-14T19:19:26Z)
Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning [16.22363384653305]
We introduce Chart2Code, a novel iterative dual preference learning framework for chart-to-code generation. We find that Chart2Code consistently improves out-of-distribution chart-to-code generation quality. Our framework paves the way for future advancements in chart comprehension.
arXiv Detail & Related papers (2025-04-03T07:51:20Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets. This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation. Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts. We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding [21.916774808384893]
The proposed layout instruction tuning strategy consists of two components: layout-aware Pre-training and layout-aware Supervised Finetuning. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding.
arXiv Detail & Related papers (2024-04-08T06:40:28Z)
KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction [59.039355258637315]
We propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes. KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning.
arXiv Detail & Related papers (2024-03-12T14:56:34Z)
DocLLM: A layout-aware generative language model for multimodal document understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents. Our model focuses exclusively on bounding box information to incorporate the spatial layout structure. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z)
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z)
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement. We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents. Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z)
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding [17.179384053140236]
Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models. We propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document. We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion.
arXiv Detail & Related papers (2021-04-16T23:27:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.