M2T: Masking Transformers Twice for Faster Decoding
- URL: http://arxiv.org/abs/2304.07313v1
- Date: Fri, 14 Apr 2023 14:25:44 GMT
- Title: M2T: Masking Transformers Twice for Faster Decoding
- Authors: Fabian Mentzer, Eirikur Agustsson, Michael Tschannen
- Abstract summary: We show how bidirectional transformers trained for masked token prediction can be applied to neural image compression.
We demonstrate that predefined, deterministic schedules perform as well or better for image compression.
- Score: 39.6722311745861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show how bidirectional transformers trained for masked token prediction
can be applied to neural image compression to achieve state-of-the-art results.
Such models were previously used for image generation by progressivly sampling
groups of masked tokens according to uncertainty-adaptive schedules. Unlike
these works, we demonstrate that predefined, deterministic schedules perform as
well or better for image compression. This insight allows us to use masked
attention during training in addition to masked inputs, and activation caching
during inference, to significantly speed up our models (~4 higher inference
speed) at a small increase in bitrate.
Related papers
- Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning [49.275450836604726]
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training.
We employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input.
arXiv Detail & Related papers (2024-09-16T15:10:07Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions.
Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.