Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer
- URL: http://arxiv.org/abs/2206.04452v1
- Date: Thu, 9 Jun 2022 12:25:24 GMT
- Title: Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer
- Authors: Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han
- Abstract summary: We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
- Score: 40.04085054791994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although autoregressive models have achieved promising results on image
generation, their unidirectional generation process prevents the resultant
images from fully reflecting global contexts. To address the issue, we propose
an effective image generation framework of Draft-and-Revise with Contextual
RQ-transformer to consider global contexts during the generation process. As a
generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a
sequence of discrete code stacks. After code stacks in the sequence are
randomly masked, Contextual RQ-Transformer is trained to infill the masked code
stacks based on the unmasked contexts of the image. Then, Contextual
RQ-Transformer uses our two-phase decoding, Draft-and-Revise, and generates an
image, while exploiting the global contexts of the image during the generation
process. Specifically. in the draft phase, our model first focuses on
generating diverse images despite rather low quality. Then, in the revise
phase, the model iteratively improves the quality of images, while preserving
the global contexts of generated images. In experiments, our method achieves
state-of-the-art results on conditional image generation. We also validate that
the Draft-and-Revise decoding can achieve high performance by effectively
controlling the quality-diversity trade-off in image generation.
Related papers
- QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation [101.28446308930367]
Quantized Language-Image Pretraining (QLIP) combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.
QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.
We demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
arXiv Detail & Related papers (2025-02-07T18:59:57Z) - UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.
Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.
Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - Masked Generative Story Transformer with Character Guidance and Caption
Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences.
Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately.
We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z) - SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation [39.84456803546365]
SSR-Encoder is a novel architecture designed for selectively capturing any subject from single or multiple reference images.
It responds to various query modalities including text and masks, without necessitating test-time fine-tuning.
Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules.
arXiv Detail & Related papers (2023-12-26T14:39:11Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Masked and Adaptive Transformer for Exemplar Based Image Translation [16.93344592811513]
Cross-domain semantic matching is challenging.
We propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence.
We devise a novel contrastive style learning method, for acquire quality-discriminative style representations.
arXiv Detail & Related papers (2023-03-30T03:21:14Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z) - High-Quality Pluralistic Image Completion via Code Shared VQGAN [51.7805154545948]
We present a novel framework for pluralistic image completion that can achieve both high quality and diversity at much faster inference speed.
Our framework is able to learn semantically-rich discrete codes efficiently and robustly, resulting in much better image reconstruction quality.
arXiv Detail & Related papers (2022-04-05T01:47:35Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.