A Pytorch Reproduction of Masked Generative Image Transformer
- URL: http://arxiv.org/abs/2310.14400v1
- Date: Sun, 22 Oct 2023 20:21:11 GMT
- Title: A Pytorch Reproduction of Masked Generative Image Transformer
- Authors: Victor Besnier and Mickael Chen
- Abstract summary: We present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch.
The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps.
We achieve results that closely align with the findings presented in the original paper.
- Score: 4.205139792076062
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this technical report, we present a reproduction of MaskGIT: Masked
Generative Image Transformer, using PyTorch. The approach involves leveraging a
masked bidirectional transformer architecture, enabling image generation with
only few steps (8~16 steps) for 512 x 512 resolution images, i.e., ~64x faster
than an auto-regressive approach. Through rigorous experimentation and
optimization, we achieved results that closely align with the findings
presented in the original paper. We match the reported FID of 7.32 with our
replication and obtain 7.59 with similar hyperparameters on ImageNet at
resolution 512 x 512. Moreover, we improve over the official implementation
with some minor hyperparameter tweaking, achieving FID of 7.26. At the lower
resolution of 256 x 256 pixels, our reimplementation scores 6.80, in comparison
to the original paper's 6.18. To promote further research on Masked Generative
Models and facilitate their reproducibility, we released our code and
pre-trained weights openly at https://github.com/valeoai/MaskGIT-pytorch/
Related papers
- An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.