Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- URL: http://arxiv.org/abs/2206.10789v1
- Date: Wed, 22 Jun 2022 01:11:29 GMT
- Title: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
- Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui
Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben
Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge,
Yonghui Wu
- Abstract summary: Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
- Score: 95.02406834386814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich
synthesis involving complex compositions and world knowledge. Parti treats
text-to-image generation as a sequence-to-sequence modeling problem, akin to
machine translation, with sequences of image tokens as the target outputs
rather than text tokens in another language. This strategy can naturally tap
into the rich body of prior work on large language models, which have seen
continued advances in capabilities and performance through scaling data and
model sizes. Our approach is simple: First, Parti uses a Transformer-based
image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
Second, we achieve consistent quality improvements by scaling the
encoder-decoder Transformer model up to 20B parameters, with a new
state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on
MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts
(P2), a new holistic benchmark of over 1600 English prompts, demonstrate the
effectiveness of Parti across a wide variety of categories and difficulty
aspects. We also explore and highlight limitations of our models in order to
define and exemplify key areas of focus for further improvements. See
https://parti.research.google/ for high-resolution images.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation [44.740794326596664]
TheaterGen is a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models.
Within this framework, LLMs, acting as "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book.
With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images.
arXiv Detail & Related papers (2024-04-29T17:58:14Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.