Controllable Image Generation With Composed Parallel Token Prediction
- URL: http://arxiv.org/abs/2405.06535v1
- Date: Fri, 10 May 2024 15:27:35 GMT
- Title: Controllable Image Generation With Composed Parallel Token Prediction
- Authors: Jamie Stirling, Noura Al-Moubayed,
- Abstract summary: compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training.
We propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space.
- Score: 5.107886283951882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fr\'echet Inception Distance (FID) scores. Our method attains an average generation accuracy of $80.71\%$ across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our method offers a $2.3\times$ to $12\times$ speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.
Related papers
- E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling [17.62612090885471]
ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling) is presented.
It operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage.
ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$times$ FLOPs reduction and 5$times$ speedup to generate a 256$times $256 image.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.
Our approach enables versatile capabilities via different inference-time sampling schemes.
Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z) - Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks.
We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs)
By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.
AdaDiff is optimized using a policy method to maximize a carefully designed reward function.
We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z) - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster
Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks.
However, their widespread adoption is hindered by the high computational cost, which limits their real-time application.
We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z) - Flow Matching in Latent Space [2.9330609943398525]
Flow matching is a framework to train generative models that exhibits impressive empirical performance.
We propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency.
Our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks.
arXiv Detail & Related papers (2023-07-17T17:57:56Z) - Controllable and Compositional Generation with Latent-Space Energy-Based
Models [60.87740144816278]
Controllable generation is one of the key requirements for successful adoption of deep generative models in real-world applications.
In this work, we use energy-based models (EBMs) to handle compositional generation over a set of attributes.
By composing energy functions with logical operators, this work is the first to achieve such compositionality in generating photo-realistic images of resolution 1024x1024.
arXiv Detail & Related papers (2021-10-21T03:31:45Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z) - Training End-to-end Single Image Generators without GANs [27.393821783237186]
AugurOne is a novel approach for training single image generative models.
Our approach trains an upscaling neural network using non-affine augmentations of the (single) input image.
A compact latent space is jointly learned allowing for controlled image synthesis.
arXiv Detail & Related papers (2020-04-07T17:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.