LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models
- URL: http://arxiv.org/abs/2305.15393v2
- Date: Sat, 28 Oct 2023 06:56:32 GMT
- Title: LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models
- Authors: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula,
Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang
- Abstract summary: Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions.
We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language.
- Score: 98.81962282674151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attaining a high degree of user controllability in visual generation often
requires intricate, fine-grained inputs like layouts. However, such inputs
impose a substantial burden on users when compared to simple text inputs. To
address the issue, we study how Large Language Models (LLMs) can serve as
visual planners by generating layouts from text conditions, and thus
collaborate with visual generative models. We propose LayoutGPT, a method to
compose in-context visual demonstrations in style sheet language to enhance the
visual planning skills of LLMs. LayoutGPT can generate plausible layouts in
multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also
shows superior performance in converting challenging language concepts like
numerical and spatial relations to layout arrangements for faithful
text-to-image generation. When combined with a downstream image generation
model, LayoutGPT outperforms text-to-image models/systems by 20-40% and
achieves comparable performance as human users in designing visual layouts for
numerical and spatial correctness. Lastly, LayoutGPT achieves comparable
performance to supervised methods in 3D indoor scene synthesis, demonstrating
its effectiveness and potential in multiple visual domains.
Related papers
- GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts.
We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously.
To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - TextLap: Customizing Language Models for Text-to-Layout Planning [65.02105936609021]
We call our method TextLap (text-based layout planning)
It uses a curated instruction-based layout planning dataset (InsLap) to customize Large language models (LLMs) as a graphic designer.
We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.
arXiv Detail & Related papers (2024-10-09T19:51:38Z) - Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models [38.52953013858373]
We introduce Playground v3 (PGv3), our latest text-to-image model.
It achieves state-of-the-art (SoTA) performance across multiple testing benchmarks.
It excels in text prompt adherence, complex reasoning, and accurate text rendering.
arXiv Detail & Related papers (2024-09-16T19:52:24Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - Reason out Your Layout: Evoking the Layout Master from Large Language
Models for Text-to-Image Synthesis [47.27044390204868]
We introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators.
Our experiments demonstrate significant improvements in image quality and layout accuracy.
arXiv Detail & Related papers (2023-11-28T14:51:13Z) - AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z) - Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback [20.151147653552155]
Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
arXiv Detail & Related papers (2023-05-25T07:43:39Z) - Composition-aware Graphic Layout GAN for Visual-textual Presentation
Designs [24.29890251913182]
We study the graphic layout generation problem of producing high-quality visual-textual presentation designs for given images.
We propose a deep generative model, dubbed as composition-aware graphic layout GAN (CGL-GAN), to synthesize layouts based on the global and spatial visual contents of input images.
arXiv Detail & Related papers (2022-04-30T16:42:13Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.