DreamCom: Finetuning Text-guided Inpainting Model for Image Composition
- URL: http://arxiv.org/abs/2309.15508v2
- Date: Wed, 24 Jan 2024 11:52:15 GMT
- Title: DreamCom: Finetuning Text-guided Inpainting Model for Image Composition
- Authors: Lingxiao Lu, Jiangtong Li, Bo Zhang, Li Niu
- Abstract summary: We propose DreamCom by treating image composition as text-guided image inpainting customized for certain object.
Specifically, we finetune pretrained text-guided image inpainting model based on a few reference images containing the same object.
In practice, the inserted object may be adversely affected by the background, so we propose masked attention mechanisms to avoid negative background interference.
- Score: 24.411003826961686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of image composition is merging a foreground object into a
background image to obtain a realistic composite image. Recently, generative
composition methods are built on large pretrained diffusion models, due to
their unprecedented image generation ability. However, they are weak in
preserving the foreground object details. Inspired by recent text-to-image
generation customized for certain object, we propose DreamCom by treating image
composition as text-guided image inpainting customized for certain object.
Specifically , we finetune pretrained text-guided image inpainting model based
on a few reference images containing the same object, during which the text
prompt contains a special token associated with this object. Then, given a new
background, we can insert this object into the background with the text prompt
containing the special token. In practice, the inserted object may be adversely
affected by the background, so we propose masked attention mechanisms to avoid
negative background interference. Experimental results on DreamEditBench and
our contributed MureCom dataset show the outstanding performance of our
DreamCom.
Related papers
- GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects.
Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation.
arXiv Detail & Related papers (2024-09-13T03:40:58Z) - Improving Text-guided Object Inpainting with Semantic Pre-inpainting [95.17396565347936]
We decompose the typical single-stage object inpainting into two cascaded processes: semantic pre-inpainting and high-fieldity object generation.
To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion framework.
arXiv Detail & Related papers (2024-09-12T17:55:37Z) - Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model [81.96954332787655]
We introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control.
In experiments, Diffree adds new objects with a high success rate while maintaining background consistency, spatial, and object relevance and quality.
arXiv Detail & Related papers (2024-07-24T03:58:58Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - SIEDOB: Semantic Image Editing by Disentangling Object and Background [5.149242555705579]
We propose a novel paradigm for semantic image editing.
textbfSIEDOB, the core idea of which is to explicitly leverage several heterogeneousworks for objects and backgrounds.
We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines.
arXiv Detail & Related papers (2023-03-23T06:17:23Z) - SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model [27.91089554671927]
Generic image inpainting aims to complete a corrupted image by borrowing surrounding information.
By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content.
We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance.
arXiv Detail & Related papers (2022-12-09T18:36:13Z) - Shape-guided Object Inpainting [84.18768707298105]
This work studies a new image inpainting task, i.e. shape-guided object inpainting.
We propose a new data preparation method and a novel Contextual Object Generator (CogNet) for the object inpainting task.
Experiments demonstrate that the proposed method can generate realistic objects that fit the context in terms of both visual appearance and semantic meanings.
arXiv Detail & Related papers (2022-04-16T17:19:11Z) - BachGAN: High-Resolution Image Synthesis from Salient Object Layout [78.51640906030244]
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
Two main challenges spring from this new task: (i) how to generate fine-grained details and realistic textures without segmentation map input; and (ii) how to create a background and weave it seamlessly into standalone objects.
By generating the hallucinated background representation dynamically, our model can synthesize high-resolution images with both photo-realistic foreground and integral background.
arXiv Detail & Related papers (2020-03-26T00:54:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.