Related papers: MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

URL: http://arxiv.org/abs/2511.12480v1
Date: Sun, 16 Nov 2025 07:11:33 GMT
Title: MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning
Authors: Jingshan Hong, Haigen Hu, Huihuang Zhang, Qianwei Zhou, Zhao Li,
Abstract summary: MaskAnyNet combines masking with a relearning mechanism to exploit both visible and masked information.<n> Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks.
Score: 7.222969785370652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

Related papers

From Pixels to Components: Eigenvector Masking for Visual Representation Learning [55.567395509598065]
Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning.<n>We propose an alternative masking strategy that operates on a suitable transformation of the data rather than on the raw pixels.
arXiv Detail & Related papers (2025-02-10T10:06:46Z)
High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation [109.19165503929992]
We present MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP.<n>After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets.<n>We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z)
MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time. It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z)
Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where [63.61248884015162]
We aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks. We propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background.
arXiv Detail & Related papers (2023-09-22T09:58:38Z)
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
MixMask: Revisiting Masking Strategy for Siamese ConvNets [23.946791390657875]
This work introduces a novel filling-based masking approach, termed textbfMixMask. The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods. We empirically validate our framework's enhanced performance in areas such as linear probing, semi-supervised and supervised finetuning, object detection and segmentation.
arXiv Detail & Related papers (2022-10-20T17:54:03Z)
Semantic-guided Multi-Mask Image Harmonization [10.27974860479791]
We propose a new semantic-guided multi-mask image harmonization task. In this work, we propose a novel way to edit the inharmonious images by predicting a series of operator masks.
arXiv Detail & Related papers (2022-07-24T11:48:49Z)
Image Inpainting by End-to-End Cascaded Refinement with Mask Awareness [66.55719330810547]
Inpainting arbitrary missing regions is challenging because learning valid features for various masked regions is nontrivial. We propose a novel mask-aware inpainting solution that learns multi-scale features for missing regions in the encoding phase. Our framework is validated both quantitatively and qualitatively via extensive experiments on three public datasets.
arXiv Detail & Related papers (2021-04-28T13:17:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.