Related papers: MaskGIT: Masked Generative Image Transformer

MaskGIT: Masked Generative Image Transformer

URL: http://arxiv.org/abs/2202.04200v1
Date: Tue, 8 Feb 2022 23:54:06 GMT
Title: MaskGIT: Masked Generative Image Transformer
Authors: Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
Abstract summary: MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
Score: 49.074967597485475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

Related papers

Spectral Image Tokenizer [21.84385276311364]
Image tokenizers map images to sequences of discrete tokens. We propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT) We evaluate the tokenizer metrics as multiscale image generation, text-guided image upsampling and editing.
arXiv Detail & Related papers (2024-12-12T18:59:31Z)
Lazy Diffusion Transformer for Interactive Image Editing [79.75128130739598]
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications.
arXiv Detail & Related papers (2024-04-18T17:59:27Z)
M2T: Masking Transformers Twice for Faster Decoding [39.6722311745861]
We show how bidirectional transformers trained for masked token prediction can be applied to neural image compression. We demonstrate that predefined, deterministic schedules perform as well or better for image compression.
arXiv Detail & Related papers (2023-04-14T14:25:44Z)
MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z)
Improved Masked Image Generation with Token-Critic [16.749458173904934]
We introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. A state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity.
arXiv Detail & Related papers (2022-09-09T17:57:21Z)
MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions. Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z)
Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition. We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.