Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning
- URL: http://arxiv.org/abs/2409.10362v1
- Date: Mon, 16 Sep 2024 15:10:07 GMT
- Title: Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning
- Authors: Amin Karimi Monsefi, Mengxi Zhou, Nastaran Karimi Monsefi, Ser-Nam Lim, Wei-Lun Chao, Rajiv Ramnath,
- Abstract summary: We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training.
We employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input.
- Score: 49.275450836604726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
Related papers
- Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising [19.08732222562782]
Supervised deep learning has become the method of choice for image denoising.
We show that, contrary to popular belief, denoising networks specialized in the removal of Gaussian noise can be efficiently leveraged in favor of real-world image denoising.
arXiv Detail & Related papers (2024-07-24T16:23:46Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Masked Image Training for Generalizable Deep Image Denoising [53.03126421917465]
We present a novel approach to enhance the generalization performance of denoising networks.
Our method involves masking random pixels of the input image and reconstructing the missing information during training.
Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios.
arXiv Detail & Related papers (2023-03-23T09:33:44Z) - Frequency Dropout: Feature-Level Regularization via Randomized Filtering [24.53978165468098]
Deep convolutional neural networks are susceptible to picking up spurious correlations from the training signal.
We propose a training strategy, Frequency Dropout, to prevent convolutional neural networks from learning frequency-specific imaging features.
Our results suggest that the proposed approach does not only improve predictive accuracy but also improves robustness against domain shift.
arXiv Detail & Related papers (2022-09-20T16:42:21Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.