Related papers: Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

URL: http://arxiv.org/abs/2411.15466v1
Date: Sat, 23 Nov 2024 06:17:43 GMT
Title: Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Authors: Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon,
Abstract summary: Diptych Prompting is a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment. Our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing.
Score: 44.620847608977776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

Related papers

Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation. T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme. Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation [46.43776651071455]
Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation. This paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation. Our SceneBooth fixes the given subject image and generates its background image guided by the text prompts.
arXiv Detail & Related papers (2025-01-07T03:18:15Z)
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics [3.9717825324709413]
Style has been primarily considered in terms of artistic elements such as colors, brushstrokes, and lighting. In this study, we propose a zero-shot scheme for image variation with coordinated semantics.
arXiv Detail & Related papers (2024-10-24T08:34:57Z)
Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis [63.757624792753205]
We present Zero-Painter, a framework for layout-conditional text-to-image synthesis. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity.
arXiv Detail & Related papers (2024-06-06T13:02:00Z)
Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance [17.251982243534144]
LAR-Gen is a novel approach for image inpainting that enables seamless inpainting of masked scene images. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. Experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z)
Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment [10.82748329166797]
We propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a textitfrozen pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt.
arXiv Detail & Related papers (2023-08-30T08:40:15Z)
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation. Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing. Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents. We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.