A Pytorch Reproduction of Masked Generative Image Transformer
- URL: http://arxiv.org/abs/2310.14400v1
- Date: Sun, 22 Oct 2023 20:21:11 GMT
- Title: A Pytorch Reproduction of Masked Generative Image Transformer
- Authors: Victor Besnier and Mickael Chen
- Abstract summary: We present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch.
The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps.
We achieve results that closely align with the findings presented in the original paper.
- Score: 4.205139792076062
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this technical report, we present a reproduction of MaskGIT: Masked
Generative Image Transformer, using PyTorch. The approach involves leveraging a
masked bidirectional transformer architecture, enabling image generation with
only few steps (8~16 steps) for 512 x 512 resolution images, i.e., ~64x faster
than an auto-regressive approach. Through rigorous experimentation and
optimization, we achieved results that closely align with the findings
presented in the original paper. We match the reported FID of 7.32 with our
replication and obtain 7.59 with similar hyperparameters on ImageNet at
resolution 512 x 512. Moreover, we improve over the official implementation
with some minor hyperparameter tweaking, achieving FID of 7.26. At the lower
resolution of 256 x 256 pixels, our reimplementation scores 6.80, in comparison
to the original paper's 6.18. To promote further research on Masked Generative
Models and facilitate their reproducibility, we released our code and
pre-trained weights openly at https://github.com/valeoai/MaskGIT-pytorch/
Related papers
- Language-Guided Image Tokenization for Generation [63.0859685332583]
TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics.
By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens.
TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively.
arXiv Detail & Related papers (2024-12-08T03:18:17Z) - An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image
Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation.
In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel.
With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.