LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model
- URL: http://arxiv.org/abs/2305.11577v3
- Date: Sat, 2 Mar 2024 12:03:56 GMT
- Title: LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model
- Authors: Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, Yanwei Fu
- Abstract summary: LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
- Score: 55.20469538848806
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces LeftRefill, an innovative approach to efficiently
harness large Text-to-Image (T2I) diffusion models for reference-guided image
synthesis. As the name implies, LeftRefill horizontally stitches reference and
target views together as a whole input. The reference image occupies the left
side, while the target canvas is positioned on the right. Then, LeftRefill
paints the right-side target canvas based on the left-side reference and
specific task instructions. Such a task formulation shares some similarities
with contextual inpainting, akin to the actions of a human painter. This novel
formulation efficiently learns both structural and textured correspondence
between reference and target without other image encoders or adapters. We
inject task and view information through cross-attention modules in T2I models,
and further exhibit multi-view reference ability via the re-arranged
self-attention modules. These enable LeftRefill to perform consistent
generation as a generalized model without requiring test-time fine-tuning or
model modifications. Thus, LeftRefill can be seen as a simple yet unified
framework to address reference-guided synthesis. As an exemplar, we leverage
LeftRefill to address two different challenges: reference-guided inpainting and
novel view synthesis, based on the pre-trained StableDiffusion. Codes and
models are released at https://github.com/ewrfcas/LeftRefill.
Related papers
- AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks [116.8706375364465]
We present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks.
We propose AnyRefill, that effectively adapts Text-to-Image (T2I) models to various vision tasks.
arXiv Detail & Related papers (2025-02-16T15:12:40Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing [59.3017821001455]
SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
arXiv Detail & Related papers (2023-10-12T07:40:39Z) - Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Compact Bidirectional Transformer for Image Captioning [15.773455578749118]
We introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly.
We conduct extensive ablation studies on the MSCOCO benchmark and find that the compact architecture serves as a regularization for implicitly exploiting bidirectional context.
We achieve new state-of-the-art results in comparison with non-vision-language-pretraining models.
arXiv Detail & Related papers (2022-01-06T09:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.