A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis
in Quantized Latent Spaces
- URL: http://arxiv.org/abs/2211.07292v2
- Date: Tue, 23 May 2023 16:33:55 GMT
- Title: A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis
in Quantized Latent Spaces
- Authors: Dominic Rampas, Pablo Pernias, and Marc Aubreville
- Abstract summary: We propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process.
Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations.
To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model.
- Score: 0.7340845393655052
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in the domain of text-to-image synthesis have culminated
in a multitude of enhancements pertaining to quality, fidelity, and diversity.
Contemporary techniques enable the generation of highly intricate visuals which
rapidly approach near-photorealistic quality. Nevertheless, as progress is
achieved, the complexity of these methodologies increases, consequently
intensifying the comprehension barrier between individuals within the field and
those external to it.
In an endeavor to mitigate this disparity, we propose a streamlined approach
for text-to-image generation, which encompasses both the training paradigm and
the sampling process. Despite its remarkable simplicity, our method yields
aesthetically pleasing images with few sampling iterations, allows for
intriguing ways for conditioning the model, and imparts advantages absent in
state-of-the-art techniques. To demonstrate the efficacy of this approach in
achieving outcomes comparable to existing works, we have trained a one-billion
parameter text-conditional model, which we refer to as "Paella". In the
interest of fostering future exploration in this field, we have made our source
code and models publicly accessible for the research community.
Related papers
- Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - Rethinking Image Skip Connections in StyleGAN2 [5.929956715430167]
StyleGAN models have gained significant traction in the field of image synthesis.
The adoption of image skip connection is favored over the traditional residual connection.
We introduce the image squeeze connection, which significantly improves the quality of image synthesis.
arXiv Detail & Related papers (2024-07-08T00:21:17Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - ViewFusion: Towards Multi-View Consistency via Interpolated Denoising [48.02829400913904]
We introduce ViewFusion, a training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models.
Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation.
Our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning.
arXiv Detail & Related papers (2024-02-29T04:21:38Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - HORIZON: High-Resolution Semantically Controlled Panorama Synthesis [105.55531244750019]
Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds.
Recent breakthroughs in visual synthesis have unlocked the potential for semantic control in 2D flat images, but a direct application of these methods to panorama synthesis yields distorted content.
We unveil an innovative framework for generating high-resolution panoramas, adeptly addressing the issues of spherical distortion and edge discontinuity through sophisticated spherical modeling.
arXiv Detail & Related papers (2022-10-10T09:43:26Z) - Retrieval-Augmented Diffusion Models [11.278903078792917]
We propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database.
By leveraging CLIP's joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained.
Our approach incurs low computational and memory overheads and is easy to implement.
arXiv Detail & Related papers (2022-04-25T17:55:26Z) - Adversarial Text-to-Image Synthesis: A Review [7.593633267653624]
We contextualize the state of the art of adversarial text-to-image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.
We critically examine current strategies to evaluate text-to-image synthesis models, highlight shortcomings, and identify new areas of research, ranging from the development of better datasets and evaluation metrics to possible improvements in architectural design and model training.
This review complements previous surveys on generative adversarial networks with a focus on text-to-image synthesis which we believe will help researchers to further advance the field.
arXiv Detail & Related papers (2021-01-25T09:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.