Related papers: Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

URL: http://arxiv.org/abs/2310.08541v1
Date: Thu, 12 Oct 2023 17:34:20 GMT
Title: Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
Authors: Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
Abstract summary: We introduce Idea to Image'', a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities.
Score: 121.42924593374127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.

Related papers

IA-T2I: Internet-Augmented Text-to-Image Generation [13.765327654914199]
Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain.<n>We propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images.
arXiv Detail & Related papers (2025-05-21T17:31:49Z)
Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z)
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation [38.517814177255765]
We introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation. We propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
arXiv Detail & Related papers (2025-03-03T08:36:16Z)
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting [18.002126814513417]
ChatGen-Evo is a multi-stage evolution strategy that progressively equips models with essential automation skills. ChatGen-Evo significantly enhances performance over various baselines.
arXiv Detail & Related papers (2024-11-26T07:31:12Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. We extend the base T2I model for various applications and create a multifunctional generation system. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [150.57983348059528]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts. It can effectively generate desired concepts given only black-box access to T2I models. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets. We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks. We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z)
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.