Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer
- URL: http://arxiv.org/abs/2206.04452v1
- Date: Thu, 9 Jun 2022 12:25:24 GMT
- Title: Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer
- Authors: Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han
- Abstract summary: We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
- Score: 40.04085054791994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although autoregressive models have achieved promising results on image
generation, their unidirectional generation process prevents the resultant
images from fully reflecting global contexts. To address the issue, we propose
an effective image generation framework of Draft-and-Revise with Contextual
RQ-transformer to consider global contexts during the generation process. As a
generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a
sequence of discrete code stacks. After code stacks in the sequence are
randomly masked, Contextual RQ-Transformer is trained to infill the masked code
stacks based on the unmasked contexts of the image. Then, Contextual
RQ-Transformer uses our two-phase decoding, Draft-and-Revise, and generates an
image, while exploiting the global contexts of the image during the generation
process. Specifically. in the draft phase, our model first focuses on
generating diverse images despite rather low quality. Then, in the revise
phase, the model iteratively improves the quality of images, while preserving
the global contexts of generated images. In experiments, our method achieves
state-of-the-art results on conditional image generation. We also validate that
the Draft-and-Revise decoding can achieve high performance by effectively
controlling the quality-diversity trade-off in image generation.
Related papers
- Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - Masked Generative Story Transformer with Character Guidance and Caption
Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences.
Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately.
We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z) - SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation [39.84456803546365]
SSR-Encoder is a novel architecture designed for selectively capturing any subject from single or multiple reference images.
It responds to various query modalities including text and masks, without necessitating test-time fine-tuning.
Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules.
arXiv Detail & Related papers (2023-12-26T14:39:11Z) - Energy-Based Cross Attention for Bayesian Context Update in
Text-to-Image Diffusion Models [62.603753097900466]
We present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors.
Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder.
Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts.
arXiv Detail & Related papers (2023-06-16T14:30:41Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Masked and Adaptive Transformer for Exemplar Based Image Translation [16.93344592811513]
Cross-domain semantic matching is challenging.
We propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence.
We devise a novel contrastive style learning method, for acquire quality-discriminative style representations.
arXiv Detail & Related papers (2023-03-30T03:21:14Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z) - High-Quality Pluralistic Image Completion via Code Shared VQGAN [51.7805154545948]
We present a novel framework for pluralistic image completion that can achieve both high quality and diversity at much faster inference speed.
Our framework is able to learn semantically-rich discrete codes efficiently and robustly, resulting in much better image reconstruction quality.
arXiv Detail & Related papers (2022-04-05T01:47:35Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.