Masked Frequency Modeling for Self-Supervised Visual Pre-Training
- URL: http://arxiv.org/abs/2206.07706v2
- Date: Tue, 25 Apr 2023 17:29:15 GMT
- Title: Masked Frequency Modeling for Self-Supervised Visual Pre-Training
- Authors: Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen
Change Loy
- Abstract summary: We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
- Score: 102.89756957704138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Masked Frequency Modeling (MFM), a unified frequency-domain-based
approach for self-supervised pre-training of visual models. Instead of randomly
inserting mask tokens to the input embeddings in the spatial domain, in this
paper, we shift the perspective to the frequency domain. Specifically, MFM
first masks out a portion of frequency components of the input image and then
predicts the missing frequencies on the frequency spectrum. Our key insight is
that predicting masked components in the frequency domain is more ideal to
reveal underlying image patterns rather than predicting masked patches in the
spatial domain, due to the heavy spatial redundancy. Our findings suggest that
with the right configuration of mask-and-predict strategy, both the structural
information within high-frequency components and the low-level statistics among
low-frequency counterparts are useful in learning good representations. For the
first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese
framework can learn meaningful representations even using none of the
following: (i) extra data, (ii) extra model, (iii) mask token. Experimental
results on image classification and semantic segmentation, as well as several
robustness benchmarks show the competitive performance and advanced robustness
of MFM compared with recent masked image modeling approaches. Furthermore, we
also comprehensively investigate the effectiveness of classical image
restoration tasks for representation learning from a unified frequency
perspective and reveal their intriguing relations with our MFM approach.
Related papers
- Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning [49.275450836604726]
We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training.
We employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input.
arXiv Detail & Related papers (2024-09-16T15:10:07Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Frequency-Adaptive Pan-Sharpening with Mixture of Experts [22.28680499480492]
We propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening.
Our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes.
arXiv Detail & Related papers (2024-01-04T08:58:25Z) - Pre-training with Random Orthogonal Projection Image Modeling [32.667183132025094]
Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels.
We propose an Image Modeling framework based on Random Orthogonal Projection Image Modeling (ROPIM)
ROPIM reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees.
arXiv Detail & Related papers (2023-10-28T15:42:07Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision.
We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches.
FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.