CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers
- URL: http://arxiv.org/abs/2204.14217v1
- Date: Thu, 28 Apr 2022 15:51:11 GMT
- Title: CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers
- Authors: Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
- Abstract summary: A new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2.
- Score: 17.757983821569994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development of the transformer-based text-to-image models are impeded by
its slow generation and complexity for high-resolution images. In this work, we
put forward a solution based on hierarchical transformers and local parallel
auto-regressive generation. We pretrain a 6B-parameter transformer with a
simple and flexible self-supervised task, Cross-modal general language model
(CogLM), and finetune it for fast super-resolution. The new text-to-image
system, CogView2, shows very competitive generation compared to concurrent
state-of-the-art DALL-E-2, and naturally supports interactive text-guided
editing on images.
Related papers
- Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding [111.16221796950126]
We propose Lformer, a semi-autoregressive text-to-image generation model.
By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods.
Lformer can edit images without the requirement for finetuning.
arXiv Detail & Related papers (2023-03-07T11:10:22Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer [40.04085054791994]
We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
arXiv Detail & Related papers (2022-06-09T12:25:24Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - Unifying Multimodal Transformer for Bi-directional Image and Text
Generation [8.547205551848462]
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks.
We propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.
arXiv Detail & Related papers (2021-10-19T06:01:24Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - CogView: Mastering Text-to-Image Generation via Transformers [51.91562870331348]
We propose CogView, a 4-billion- parameter Transformer with VQ-VAE tokenizer to advance this problem.
We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design.
CogView achieves a new state-of-the-art FID on blurred MS COCO, outperforms previous GAN-based models and a recent similar work DALL-E.
arXiv Detail & Related papers (2021-05-26T16:52:53Z) - Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers [55.21000775547243]
We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT)
BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers.
arXiv Detail & Related papers (2021-04-26T03:52:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.