Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- URL: http://arxiv.org/abs/2209.14491v2
- Date: Sat, 1 Oct 2022 15:14:14 GMT
- Title: Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- Authors: Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen
- Abstract summary: Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
- Score: 58.60472701831404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on text-to-image generation has witnessed significant progress in
generating diverse and photo-realistic images, driven by diffusion and
auto-regressive models trained on large-scale image-text data. Though
state-of-the-art models can generate high-quality images of common entities,
they often have difficulty generating images of uncommon entities, such as
`Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the
Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model
that uses retrieved information to produce high-fidelity and faithful images,
even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an
external multi-modal knowledge base to retrieve relevant (image, text) pairs,
and uses them as references to generate the image. With this retrieval step,
Re-Imagen is augmented with the knowledge of high-level semantics and low-level
visual details of the mentioned entities, and thus improves its accuracy in
generating the entities' visual appearances. We train Re-Imagen on a
constructed dataset containing (image, text, retrieval) triples to teach the
model to ground on both text prompt and retrieval. Furthermore, we develop a
new sampling strategy to interleave the classifier-free guidance for text and
retrieval condition to balance the text and retrieval alignment. Re-Imagen
achieves new SoTA FID results on two image generation benchmarks, such as COCO
(ie, FID = 5.25) and WikiImage (ie, FID = 5.82) without fine-tuning. To further
evaluate the capabilities of the model, we introduce EntityDrawBench, a new
benchmark that evaluates image generation for diverse entities, from frequent
to rare, across multiple visual domains. Human evaluation on EntityDrawBench
shows that Re-Imagen performs on par with the best prior models in
photo-realism, but with significantly better faithfulness, especially on less
frequent entities.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.