Related papers: A Pytorch Reproduction of Masked Generative Image Transformer

A Pytorch Reproduction of Masked Generative Image Transformer

URL: http://arxiv.org/abs/2310.14400v1
Date: Sun, 22 Oct 2023 20:21:11 GMT
Title: A Pytorch Reproduction of Masked Generative Image Transformer
Authors: Victor Besnier and Mickael Chen
Abstract summary: We present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch. The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps. We achieve results that closely align with the findings presented in the original paper.
Score: 4.205139792076062
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this technical report, we present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch. The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps (8~16 steps) for 512 x 512 resolution images, i.e., ~64x faster than an auto-regressive approach. Through rigorous experimentation and optimization, we achieved results that closely align with the findings presented in the original paper. We match the reported FID of 7.32 with our replication and obtain 7.59 with similar hyperparameters on ImageNet at resolution 512 x 512. Moreover, we improve over the official implementation with some minor hyperparameter tweaking, achieving FID of 7.26. At the lower resolution of 256 x 256 pixels, our reimplementation scores 6.80, in comparison to the original paper's 6.18. To promote further research on Masked Generative Models and facilitate their reproducibility, we released our code and pre-trained weights openly at https://github.com/valeoai/MaskGIT-pytorch/

Related papers

Language-Guided Image Tokenization for Generation [63.0859685332583]
TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively.
arXiv Detail & Related papers (2024-12-08T03:18:17Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences. TiTok achieves competitive performance to state-of-the-art approaches. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z)
Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers. Specifically, we randomly mask out a high proportion of patches in diffused input images during training. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z)
CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation. Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z)
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation [88.55256389703082]
Pixel is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation. In this paper, we propose a progressive pixel synthesis network towards efficient image generation, as Pixel. With much less expenditure, Pixel obtains new state-of-the-art (SOTA) performance on two benchmark datasets.
arXiv Detail & Related papers (2022-04-02T10:55:11Z)
MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.