CogView: Mastering Text-to-Image Generation via Transformers
- URL: http://arxiv.org/abs/2105.13290v1
- Date: Wed, 26 May 2021 16:52:53 GMT
- Title: CogView: Mastering Text-to-Image Generation via Transformers
- Authors: Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin,
Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang
- Abstract summary: We propose CogView, a 4-billion- parameter Transformer with VQ-VAE tokenizer to advance this problem.
We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design.
CogView achieves a new state-of-the-art FID on blurred MS COCO, outperforms previous GAN-based models and a recent similar work DALL-E.
- Score: 51.91562870331348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image generation in the general domain has long been an open problem,
which requires both generative model and cross-modal understanding. We propose
CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance
this problem. We also demonstrate the finetuning strategies for various
downstream tasks, e.g. style learning, super-resolution, text-image ranking and
fashion design, and methods to stabilize pretraining, e.g. eliminating NaN
losses. CogView (zero-shot) achieves a new state-of-the-art FID on blurred MS
COCO, outperforms previous GAN-based models and a recent similar work DALL-E.
Related papers
- Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - Transformer-based Image Generation from Scene Graphs [11.443097632746763]
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
arXiv Detail & Related papers (2023-03-08T14:54:51Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer [40.04085054791994]
We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
arXiv Detail & Related papers (2022-06-09T12:25:24Z) - CogVideo: Large-scale Pretraining for Text-to-Video Generation via
Transformers [16.255516347736535]
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.
CogVideo is trained by inheriting a pretrained text-to-image model, CogView2.
CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
arXiv Detail & Related papers (2022-05-29T19:02:15Z) - Overparameterization Improves StyleGAN Inversion [66.8300251627992]
Existing inversion approaches obtain promising yet imperfect results.
We show that this allows us to obtain near-perfect image reconstruction without the need for encoders.
Our approach also retains editability, which we demonstrate by realistically interpolating between images.
arXiv Detail & Related papers (2022-05-12T18:42:43Z) - CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers [17.757983821569994]
A new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2.
arXiv Detail & Related papers (2022-04-28T15:51:11Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - VCE: Variational Convertor-Encoder for One-Shot Generalization [3.86981854389977]
Variational Convertor-Encoder (VCE) converts an image to various styles.
We present this novel architecture for the problem of one-shot generalization.
We also improve the performance of variational auto-encoder (VAE) to filter those blurred points.
arXiv Detail & Related papers (2020-11-12T07:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.