Text-Guided Synthesis of Artistic Images with Retrieval-Augmented
Diffusion Models
- URL: http://arxiv.org/abs/2207.13038v1
- Date: Tue, 26 Jul 2022 16:56:51 GMT
- Title: Text-Guided Synthesis of Artistic Images with Retrieval-Augmented
Diffusion Models
- Authors: Robin Rombach and Andreas Blattmann and Bj\"orn Ommer
- Abstract summary: We present an alternative approach based on retrieval-augmented diffusion models (RDMs)
We replace the retrieval database with a more specialized database that contains only images of a particular visual style.
This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style.
- Score: 12.676356746752894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Novel architectures have recently improved generative image synthesis leading
to excellent visual quality in various tasks. Of particular note is the field
of ``AI-Art'', which has seen unprecedented growth with the emergence of
powerful multimodal models such as CLIP. By combining speech and image
synthesis models, so-called ``prompt-engineering'' has become established, in
which carefully selected and composed sentences are used to achieve a certain
visual style in the synthesized image. In this note, we present an alternative
approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set
of nearest neighbors is retrieved from an external database during training for
each training instance, and the diffusion model is conditioned on these
informative samples. During inference (sampling), we replace the retrieval
database with a more specialized database that contains, for example, only
images of a particular visual style. This provides a novel way to prompt a
general trained model after training and thereby specify a particular visual
style. As shown by our experiments, this approach is superior to specifying the
visual style within the text prompt. We open-source code and model weights at
https://github.com/CompVis/latent-diffusion .
Related papers
- ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Measuring Style Similarity in Diffusion Models [118.22433042873136]
We present a framework for understanding and extracting style descriptors from images.
Our framework comprises a new dataset curated using the insight that style is a subjective property of an image.
We also propose a method to extract style attribute descriptors that can be used to style of a generated image to the images used in the training dataset of a text-to-image model.
arXiv Detail & Related papers (2024-04-01T17:58:30Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - SGDiff: A Style Guided Diffusion Model for Fashion Synthesis [2.4578723416255754]
The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis.
It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance.
This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications.
arXiv Detail & Related papers (2023-08-15T07:20:22Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z) - Retrieval-Augmented Diffusion Models [11.278903078792917]
We propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database.
By leveraging CLIP's joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained.
Our approach incurs low computational and memory overheads and is easy to implement.
arXiv Detail & Related papers (2022-04-25T17:55:26Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.