A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
- URL: http://arxiv.org/abs/2410.01912v1
- Date: Wed, 2 Oct 2024 18:10:05 GMT
- Title: A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
- Authors: Liang Chen, Sinan Tan, Zefan Cai, Weichu Xie, Haozhe Zhao, Yichi Zhang, Junyang Lin, Jinze Bai, Tianyu Liu, Baobao Chang,
- Abstract summary: This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer.
The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, textitmodel depth, along with the sequence length direction.
It can generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities.
- Score: 45.24970921978198
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer.
Related papers
- Autoregressive model path dependence near Ising criticality [0.0]
We study the reconstruction of critical correlations in the two-dimensional (2D) Ising model.
We compare the training performance for a number of different 1D autoregressive sequences imposed on finite-size 2D lattices.
arXiv Detail & Related papers (2024-08-28T11:21:33Z) - Hybrid Fourier Score Distillation for Efficient One Image to 3D Object Generation [42.83810819513537]
Single image-to-3D generation is pivotal for crafting controllable 3D assets.
We propose a 2D-3D hybrid Fourier Score Distillation objective function, hy-FSD.
hy-FSD can be integrated into existing 3D generation methods and produce significant performance gains.
arXiv Detail & Related papers (2024-05-31T08:11:25Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - Towards a Neural Graphics Pipeline for Controllable Image Generation [96.11791992084551]
We present Neural Graphics Pipeline (NGP), a hybrid generative model that brings together neural and traditional image formation models.
NGP decomposes the image into a set of interpretable appearance feature maps, uncovering direct control handles for controllable image generation.
We demonstrate the effectiveness of our approach on controllable image generation of single-object scenes.
arXiv Detail & Related papers (2020-06-18T14:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.