SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing
- URL: http://arxiv.org/abs/2310.08094v1
- Date: Thu, 12 Oct 2023 07:40:39 GMT
- Title: SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing
- Authors: Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, Xiang Bai
- Abstract summary: SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
- Score: 59.3017821001455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in text-to-image (T2I) models enables high-quality image
generation with flexible textual control. To utilize the abundant visual priors
in the off-the-shelf T2I models, a series of methods try to invert an image to
proper embedding that aligns with the semantic space of the T2I model. However,
these image-to-text (I2T) inversion methods typically need multiple source
images containing the same concept or struggle with the imbalance between
editing flexibility and visual fidelity. In this work, we point out that the
critical problem lies in the foreground-background entanglement when learning
an intended concept, and propose a simple and effective baseline for
single-image I2T inversion, named SingleInsert. SingleInsert adopts a two-stage
scheme. In the first stage, we regulate the learned embedding to concentrate on
the foreground area without being associated with the irrelevant background. In
the second stage, we finetune the T2I model for better visual resemblance and
devise a semantic loss to prevent the language drift problem. With the proposed
techniques, SingleInsert excels in single concept generation with high visual
fidelity while allowing flexible editing. Additionally, SingleInsert can
perform single-image novel view synthesis and multiple concepts composition
without requiring joint training. To facilitate evaluation, we design an
editing prompt list and introduce a metric named Editing Success Rate (ESR) for
quantitative assessment of editing flexibility. Our project page is:
https://jarrentwu1031.github.io/SingleInsert-web/
Related papers
- Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Forgedit: Text Guided Image Editing via Learning and Forgetting [17.26772361532044]
We design a novel text-guided image editing method, named as Forgedit.
First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds.
Then, we propose a novel vector projection mechanism in text embedding space of Diffusion Models.
arXiv Detail & Related papers (2023-09-19T12:05:26Z) - Continuous Layout Editing of Single Images with Diffusion Models [24.581184791106562]
We propose the first framework for layout editing of a single image while preserving its visual properties.
Our approach is achieved through two key modules.
Our code will be freely available for public use upon acceptance.
arXiv Detail & Related papers (2023-06-22T17:51:05Z) - Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
Models [94.25020178662392]
Text-to-image (T2I) research has grown explosively in the past year.
One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science.
In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
arXiv Detail & Related papers (2023-05-25T16:30:07Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a
Single Image [2.999198565272416]
We make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image.
We propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image.
We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
arXiv Detail & Related papers (2022-10-17T23:46:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.