Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
        - URL: http://arxiv.org/abs/2303.03800v1
- Date: Tue, 7 Mar 2023 11:10:22 GMT
- Title: Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
- Authors: Jiacheng Li, Longhui Wei, ZongYuan Zhan, Xin He, Siliang Tang, Qi
  Tian, Yueting Zhuang
- Abstract summary: We propose Lformer, a semi-autoregressive text-to-image generation model.
By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods.
Lformer can edit images without the requirement for finetuning.
- Score: 111.16221796950126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Generative transformers have shown their superiority in synthesizing
high-fidelity and high-resolution images, such as good diversity and training
stability. However, they suffer from the problem of slow generation since they
need to generate a long token sequence autoregressively. To better accelerate
the generative transformers while keeping good generation quality, we propose
Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly
encodes an image into $h{\times}h$ discrete tokens, then divides these tokens
into $h$ mirrored L-shape blocks from the top left to the bottom right and
decodes the tokens in a block parallelly in each step. Lformer predicts the
area adjacent to the previous context like autoregressive models thus it is
more stable while accelerating. By leveraging the 2D structure of image tokens,
Lformer achieves faster speed than the existing transformer-based methods while
keeping good generation quality. Moreover, the pretrained Lformer can edit
images without the requirement for finetuning. We can roll back to the early
steps for regeneration or edit the image with a bounding box and a text prompt.
 
      
        Related papers
        - Locality-aware Parallel Decoding for Efficient Autoregressive Image   Generation [10.421912048948634]
 We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation.<n>We reduce the generation steps from 256 to 20 (256$times$256 res.) and 1024 to 48 without compromising quality on the ImageNet class-conditional generation.
 arXiv  Detail & Related papers  (2025-07-02T17:59:23Z)
- Token-Shuffle: Towards High-Resolution Image Generation with   Autoregressive Models [92.18057318458528]
 Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
 arXiv  Detail & Related papers  (2025-04-24T17:59:56Z)
- A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive   Transformer for Efficient Finegrained Image Generation [45.24970921978198]
 This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer.
The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, textitmodel depth, along with the sequence length direction.
It can generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities.
 arXiv  Detail & Related papers  (2024-10-02T18:10:05Z)
- Accelerating Auto-regressive Text-to-Image Generation with Training-free   Speculative Jacobi Decoding [60.188309982690335]
 We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
 arXiv  Detail & Related papers  (2024-10-02T16:05:27Z)
- Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
 Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
 arXiv  Detail & Related papers  (2023-12-22T10:01:54Z)
- LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
  Generation [121.45667242282721]
 We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
 arXiv  Detail & Related papers  (2023-08-09T17:45:04Z)
- Towards Accurate Image Coding: Improved Autoregressive Image Generation
  with Dynamic Vector Quantization [73.52943587514386]
 Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
 arXiv  Detail & Related papers  (2023-05-19T14:56:05Z)
- Progressive Text-to-Image Generation [40.09326229583334]
 We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
 arXiv  Detail & Related papers  (2022-10-05T14:27:20Z)
- Improved Masked Image Generation with Token-Critic [16.749458173904934]
 We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer.
A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
 arXiv  Detail & Related papers  (2022-09-09T17:57:21Z)
- CogView2: Faster and Better Text-to-Image Generation via Hierarchical
  Transformers [17.757983821569994]
 A new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2.
 arXiv  Detail & Related papers  (2022-04-28T15:51:11Z)
- MaskGIT: Masked Generative Image Transformer [49.074967597485475]
 MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
 arXiv  Detail & Related papers  (2022-02-08T23:54:06Z)
- Semi-Autoregressive Transformer for Image Captioning [17.533503295862808]
 We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
 arXiv  Detail & Related papers  (2021-06-17T12:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.