Halton Scheduler For Masked Generative Image Transformer
- URL: http://arxiv.org/abs/2503.17076v1
- Date: Fri, 21 Mar 2025 12:00:59 GMT
- Title: Halton Scheduler For Masked Generative Image Transformer
- Authors: Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, Matthieu Cord,
- Abstract summary: Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework.<n>We analyze the sampling objective in MaskGIT, based on the mutual information between tokens.<n>We propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler.
- Score: 51.82285573627426
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.
Related papers
- Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval [14.986283867293048]
Zero-shot composed image retrieval (ZS-CIR) takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling.
Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models.
We introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task.
arXiv Detail & Related papers (2023-11-13T02:49:57Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-Robust
Classifier [37.774220727662914]
We propose a one-shot mask-guided image synthesis that allows controlling manipulations of a single image.
Our proposed method, entitled MAGIC, leverages structured gradients from a pre-trained quasi-robust classifier.
MAGIC aggregates gradients over the input, driven by a guide binary mask that enforces a strong, spatial prior.
arXiv Detail & Related papers (2022-09-23T12:15:40Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - DCT-Mask: Discrete Cosine Transform Mask Representation for Instance
Segmentation [50.70679435176346]
We propose a new mask representation by applying the discrete cosine transform(DCT) to encode the high-resolution binary grid mask into a compact vector.
Our method, termed DCT-Mask, could be easily integrated into most pixel-based instance segmentation methods.
arXiv Detail & Related papers (2020-11-19T15:00:21Z) - BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [103.74690082121079]
In this work, we achieve improved mask prediction by effectively combining instance-level information with semantic information with lower-level fine-granularity.
Our main contribution is a blender module which draws inspiration from both top-down and bottom-up instance segmentation approaches.
BlendMask can effectively predict dense per-pixel position-sensitive instance features with very few channels, and learn attention maps for each instance with merely one convolution layer.
arXiv Detail & Related papers (2020-01-02T03:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.