Related papers: Controllable Image Generation With Composed Parallel Token Prediction

Controllable Image Generation With Composed Parallel Token Prediction

URL: http://arxiv.org/abs/2405.06535v1
Date: Fri, 10 May 2024 15:27:35 GMT
Title: Controllable Image Generation With Composed Parallel Token Prediction
Authors: Jamie Stirling, Noura Al-Moubayed,
Abstract summary: compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training. We propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space.
Score: 5.107886283951882
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fr\'echet Inception Distance (FID) scores. Our method attains an average generation accuracy of $80.71\%$ across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our method offers a $2.3\times$ to $12\times$ speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.

Related papers

Cost-Aware Routing for Efficient Text-To-Image Generation [19.848723289971208]
We propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity.<n>We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.
arXiv Detail & Related papers (2025-06-17T17:48:50Z)
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
Context-guided Responsible Data Augmentation with Diffusion Models [29.41191005466334]
We propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets.
arXiv Detail & Related papers (2025-03-12T00:12:27Z)
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling [17.62612090885471]
ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling) is presented. It operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$times$ FLOPs reduction and 5$times$ speedup to generate a 256$times $256 image.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models. We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks. Our approach enables versatile capabilities via different inference-time sampling schemes. Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method. Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z)
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks. We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs) By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z)
Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs) We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z)
AdaDiff: Adaptive Step Selection for Fast Diffusion [88.8198344514677]
We introduce AdaDiff, a framework designed to learn instance-specific step usage policies. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function. Our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z)
Flow Matching in Latent Space [2.9330609943398525]
Flow matching is a framework to train generative models that exhibits impressive empirical performance. We propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency. Our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks.
arXiv Detail & Related papers (2023-07-17T17:57:56Z)
Optimized latent-code selection for explainable conditional text-to-image GANs [8.26410341981427]
We present a variety of techniques to take a deep look into the latent space and semantic space of the conditional text-to-image GANs model. We propose a framework for finding good latent codes by utilizing a linear SVM.
arXiv Detail & Related papers (2022-04-27T03:12:55Z)
Controllable and Compositional Generation with Latent-Space Energy-Based Models [60.87740144816278]
Controllable generation is one of the key requirements for successful adoption of deep generative models in real-world applications. In this work, we use energy-based models (EBMs) to handle compositional generation over a set of attributes. By composing energy functions with logical operators, this work is the first to achieve such compositionality in generating photo-realistic images of resolution 1024x1024.
arXiv Detail & Related papers (2021-10-21T03:31:45Z)
IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images. We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations. IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z)
Training End-to-end Single Image Generators without GANs [27.393821783237186]
AugurOne is a novel approach for training single image generative models. Our approach trains an upscaling neural network using non-affine augmentations of the (single) input image. A compact latent space is jointly learned allowing for controlled image synthesis.
arXiv Detail & Related papers (2020-04-07T17:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.