GLIGEN: Open-Set Grounded Text-to-Image Generation
- URL: http://arxiv.org/abs/2301.07093v2
- Date: Mon, 17 Apr 2023 01:54:37 GMT
- Title: GLIGEN: Open-Set Grounded Text-to-Image Generation
- Authors: Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang,
Jianfeng Gao, Chunyuan Li, Yong Jae Lee
- Abstract summary: Grounded-Language-to-Image Generation is a novel approach that builds upon and extends the functionality of existing text-to-image diffusion models.
Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs.
GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.
- Score: 97.72536364118024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image diffusion models have made amazing advances.
However, the status quo is to use text input alone, which can impede
controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image
Generation, a novel approach that builds upon and extends the functionality of
existing pre-trained text-to-image diffusion models by enabling them to also be
conditioned on grounding inputs. To preserve the vast concept knowledge of the
pre-trained model, we freeze all of its weights and inject the grounding
information into new trainable layers via a gated mechanism. Our model achieves
open-world grounded text2img generation with caption and bounding box condition
inputs, and the grounding ability generalizes well to novel spatial
configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS
outperforms that of existing supervised layout-to-image baselines by a large
margin.
Related papers
- LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual
Grounders [31.371338262371122]
VGDiffZero is a zero-shot visual grounding framework based on text-to-image diffusion models.
We show that VGDiffZero achieves strong performance on zero-shot visual grounding.
arXiv Detail & Related papers (2023-09-03T11:32:28Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Freestyle Layout-to-Image Synthesis [42.64485133926378]
In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics.
The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
arXiv Detail & Related papers (2023-03-25T09:37:41Z) - Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.