Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
- URL: http://arxiv.org/abs/2303.03800v1
- Date: Tue, 7 Mar 2023 11:10:22 GMT
- Title: Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
- Authors: Jiacheng Li, Longhui Wei, ZongYuan Zhan, Xin He, Siliang Tang, Qi
Tian, Yueting Zhuang
- Abstract summary: We propose Lformer, a semi-autoregressive text-to-image generation model.
By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods.
Lformer can edit images without the requirement for finetuning.
- Score: 111.16221796950126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative transformers have shown their superiority in synthesizing
high-fidelity and high-resolution images, such as good diversity and training
stability. However, they suffer from the problem of slow generation since they
need to generate a long token sequence autoregressively. To better accelerate
the generative transformers while keeping good generation quality, we propose
Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly
encodes an image into $h{\times}h$ discrete tokens, then divides these tokens
into $h$ mirrored L-shape blocks from the top left to the bottom right and
decodes the tokens in a block parallelly in each step. Lformer predicts the
area adjacent to the previous context like autoregressive models thus it is
more stable while accelerating. By leveraging the 2D structure of image tokens,
Lformer achieves faster speed than the existing transformer-based methods while
keeping good generation quality. Moreover, the pretrained Lformer can edit
images without the requirement for finetuning. We can roll back to the early
steps for regeneration or edit the image with a bounding box and a text prompt.
Related papers
- A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation [45.24970921978198]
This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer.
The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, textitmodel depth, along with the sequence length direction.
It can generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities.
arXiv Detail & Related papers (2024-10-02T18:10:05Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Improved Masked Image Generation with Token-Critic [16.749458173904934]
We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer.
A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
arXiv Detail & Related papers (2022-09-09T17:57:21Z) - CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers [17.757983821569994]
A new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2.
arXiv Detail & Related papers (2022-04-28T15:51:11Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - Semi-Autoregressive Transformer for Image Captioning [17.533503295862808]
We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
arXiv Detail & Related papers (2021-06-17T12:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.