MaskGIT: Masked Generative Image Transformer
- URL: http://arxiv.org/abs/2202.04200v1
- Date: Tue, 8 Feb 2022 23:54:06 GMT
- Title: MaskGIT: Masked Generative Image Transformer
- Authors: Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
- Abstract summary: MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
- Score: 49.074967597485475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative transformers have experienced rapid popularity growth in the
computer vision community in synthesizing high-fidelity and high-resolution
images. The best generative transformer models so far, however, still treat an
image naively as a sequence of tokens, and decode an image sequentially
following the raster scan ordering (i.e. line-by-line). We find this strategy
neither optimal nor efficient. This paper proposes a novel image synthesis
paradigm using a bidirectional transformer decoder, which we term MaskGIT.
During training, MaskGIT learns to predict randomly masked tokens by attending
to tokens in all directions. At inference time, the model begins with
generating all tokens of an image simultaneously, and then refines the image
iteratively conditioned on the previous generation. Our experiments demonstrate
that MaskGIT significantly outperforms the state-of-the-art transformer model
on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x.
Besides, we illustrate that MaskGIT can be easily extended to various image
editing tasks, such as inpainting, extrapolation, and image manipulation.
Related papers
- Lazy Diffusion Transformer for Interactive Image Editing [79.75128130739598]
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently.
Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications.
arXiv Detail & Related papers (2024-04-18T17:59:27Z) - M2T: Masking Transformers Twice for Faster Decoding [39.6722311745861]
We show how bidirectional transformers trained for masked token prediction can be applied to neural image compression.
We demonstrate that predefined, deterministic schedules perform as well or better for image compression.
arXiv Detail & Related papers (2023-04-14T14:25:44Z) - MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling.
We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image.
Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z) - Improved Masked Image Generation with Token-Critic [16.749458173904934]
We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer.
A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
arXiv Detail & Related papers (2022-09-09T17:57:21Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.