LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models
- URL: http://arxiv.org/abs/2305.15393v2
- Date: Sat, 28 Oct 2023 06:56:32 GMT
- Title: LayoutGPT: Compositional Visual Planning and Generation with Large
Language Models
- Authors: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula,
Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang
- Abstract summary: Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions.
We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language.
- Score: 98.81962282674151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attaining a high degree of user controllability in visual generation often
requires intricate, fine-grained inputs like layouts. However, such inputs
impose a substantial burden on users when compared to simple text inputs. To
address the issue, we study how Large Language Models (LLMs) can serve as
visual planners by generating layouts from text conditions, and thus
collaborate with visual generative models. We propose LayoutGPT, a method to
compose in-context visual demonstrations in style sheet language to enhance the
visual planning skills of LLMs. LayoutGPT can generate plausible layouts in
multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also
shows superior performance in converting challenging language concepts like
numerical and spatial relations to layout arrangements for faithful
text-to-image generation. When combined with a downstream image generation
model, LayoutGPT outperforms text-to-image models/systems by 20-40% and
achieves comparable performance as human users in designing visual layouts for
numerical and spatial correctness. Lastly, LayoutGPT achieves comparable
performance to supervised methods in 3D indoor scene synthesis, demonstrating
its effectiveness and potential in multiple visual domains.
Related papers
- PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation [6.855409699832414]
PosterLlama is a network designed for generating visually and textually coherent layouts.
Our evaluations demonstrate that PosterLlama outperforms existing methods in producing authentic and content-aware layouts.
It supports an unparalleled range of conditions, including but not limited to unconditional layout generation, element conditional layout generation, layout completion, among others, serving as a highly versatile user manipulation tool.
arXiv Detail & Related papers (2024-04-01T08:46:35Z) - Reason out Your Layout: Evoking the Layout Master from Large Language
Models for Text-to-Image Synthesis [47.27044390204868]
We introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators.
Our experiments demonstrate significant improvements in image quality and layout accuracy.
arXiv Detail & Related papers (2023-11-28T14:51:13Z) - AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z) - A Parse-Then-Place Approach for Generating Graphic Layouts from Textual
Descriptions [50.469491454128246]
We use text as the guidance to create graphic layouts, i.e., Text-to-labeled, aiming to lower the design barriers.
Text-to-labeled is a challenging task, because it needs to consider the implicit, combined, and incomplete constraints from text.
We present a two-stage approach, named parse-then-place, to address this problem.
arXiv Detail & Related papers (2023-08-24T10:37:00Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback [20.151147653552155]
Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
arXiv Detail & Related papers (2023-05-25T07:43:39Z) - Composition-aware Graphic Layout GAN for Visual-textual Presentation
Designs [24.29890251913182]
We study the graphic layout generation problem of producing high-quality visual-textual presentation designs for given images.
We propose a deep generative model, dubbed as composition-aware graphic layout GAN (CGL-GAN), to synthesize layouts based on the global and spatial visual contents of input images.
arXiv Detail & Related papers (2022-04-30T16:42:13Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.