Wavelets Are All You Need for Autoregressive Image Generation
- URL: http://arxiv.org/abs/2406.19997v2
- Date: Tue, 19 Nov 2024 12:28:19 GMT
- Title: Wavelets Are All You Need for Autoregressive Image Generation
- Authors: Wael Mattar, Idan Levy, Nir Sharon, Shai Dekel,
- Abstract summary: We take a new approach to autoregressive image generation that is based on two main ingredients.
The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details.
The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences.
- Score: 1.187456026346823
- License:
- Abstract: In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
Related papers
- Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Exploring Invariance in Images through One-way Wave Equations [96.90549064390608]
In this paper, we empirically reveal an invariance over images-images share a set of one-way wave equations with latent speeds.
We demonstrate it using an intuitive encoder-decoder framework where each image is encoded into its corresponding initial condition.
arXiv Detail & Related papers (2023-10-19T17:59:37Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - High-Resolution Complex Scene Synthesis with Transformers [6.445605125467574]
coarse-grained synthesis of complex scene images via deep generative models has recently gained popularity.
We present an approach to this task, where the generative model is based on pure likelihood training without additional objectives.
We show that the resulting system is able to synthesize high-quality images consistent with the given layouts.
arXiv Detail & Related papers (2021-05-13T17:56:07Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z) - Taming Transformers for High-Resolution Image Synthesis [16.86600007830682]
transformers are designed to learn long-range interactions on sequential data.
They contain no inductive bias that prioritizes local interactions.
This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images.
We show how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
arXiv Detail & Related papers (2020-12-17T18:57:28Z) - Spatially-Adaptive Pixelwise Networks for Fast Image Translation [57.359250882770525]
We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation.
We use pixel-wise networks; that is, each pixel is processed independently of others.
Our model is up to 18x faster than state-of-the-art baselines.
arXiv Detail & Related papers (2020-12-05T10:02:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.