Related papers: Rethinking Random Masking in Self Distillation on ViT

Rethinking Random Masking in Self Distillation on ViT

URL: http://arxiv.org/abs/2506.10582v1
Date: Thu, 12 Jun 2025 11:19:07 GMT
Title: Rethinking Random Masking in Self Distillation on ViT
Authors: Jihyeon Seong, Hyunkyung Han,
Abstract summary: This study focuses on the role of random masking in the self-distillation setting, focusing on the DINO framework.<n>Specifically, we apply random masking exclusively to the student's global view, while preserving the student's local views and the teacher's global view in their original, unmasked forms.<n>We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student's global view, while preserving the student's local views and the teacher's global view in their original, unmasked forms. This design leverages DINO's multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.

Related papers

MINR: Implicit Neural Representations with Masked Image Modelling [5.330266804358638]
Masked autoencoders (MAE) have shown significant promise in learning robust feature representations.<n>We introduce the masked implicit neural representations (MINR) framework that synergizes implicit neural representations with masked image modeling.<n>MINR learns a continuous function to represent images, enabling more robust and generalizable reconstructions irrespective of masking strategies.
arXiv Detail & Related papers (2025-07-30T06:12:57Z)
Self-Guided Masked Autoencoder [16.96990728780005]
Masked Autoencoder (MAE) is a self-supervised approach for representation learning.<n>We propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering.
arXiv Detail & Related papers (2025-07-26T03:48:12Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization [40.78236375917571]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.<n>We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining.
arXiv Detail & Related papers (2024-02-28T07:37:26Z)
Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z)
Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where [63.61248884015162]
We aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks. We propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background.
arXiv Detail & Related papers (2023-09-22T09:58:38Z)
DPPMask: Masked Image Modeling with Determinantal Point Processes [49.65141962357528]
Masked Image Modeling (MIM) has achieved impressive representative performance with the aim of reconstructing randomly masked images. We show that uniformly random masking widely used in previous works unavoidably loses some key objects and changes original semantic information. To address this issue, we augment MIM with a new masking strategy namely the DPPMask. Our method is simple yet effective and requires no extra learnable parameters when implemented within various frameworks.
arXiv Detail & Related papers (2023-03-13T13:40:39Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
Improving self-supervised representation learning via sequential adversarial masking [12.176299580413097]
Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision. We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
arXiv Detail & Related papers (2022-12-16T04:25:43Z)
Self-Supervised Visual Representations Learning by Contrastive Mask Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.