Retrieval-Augmented Diffusion Models
- URL: http://arxiv.org/abs/2204.11824v2
- Date: Tue, 26 Apr 2022 13:37:02 GMT
- Title: Retrieval-Augmented Diffusion Models
- Authors: Andreas Blattmann, Robin Rombach, Kaan Oktay, Bj\"orn Ommer
- Abstract summary: We propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database.
By leveraging CLIP's joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained.
Our approach incurs low computational and memory overheads and is easy to implement.
- Score: 11.278903078792917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative image synthesis with diffusion models has recently achieved
excellent visual quality in several tasks such as text-based or
class-conditional image synthesis. Much of this success is due to a dramatic
increase in the computational capacity invested in training these models. This
work presents an alternative approach: inspired by its successful application
in natural language processing, we propose to complement the diffusion model
with a retrieval-based approach and to introduce an explicit memory in the form
of an external database. During training, our diffusion model is trained with
similar visual features retrieved via CLIP and from the neighborhood of each
training instance. By leveraging CLIP's joint image-text embedding space, our
model achieves highly competitive performance on tasks for which it has not
been explicitly trained, such as class-conditional or text-image synthesis, and
can be conditioned on both text and image embeddings. Moreover, we can apply
our approach to unconditional generation, where it achieves state-of-the-art
performance. Our approach incurs low computational and memory overheads and is
easy to implement. We discuss its relationship to concurrent work and will
publish code and pretrained models soon.
Related papers
- Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Text-Guided Synthesis of Artistic Images with Retrieval-Augmented
Diffusion Models [12.676356746752894]
We present an alternative approach based on retrieval-augmented diffusion models (RDMs)
We replace the retrieval database with a more specialized database that contains only images of a particular visual style.
This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style.
arXiv Detail & Related papers (2022-07-26T16:56:51Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.