Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis
- URL: http://arxiv.org/abs/2509.23054v2
- Date: Sat, 04 Oct 2025 07:50:36 GMT
- Title: Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis
- Authors: Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su,
- Abstract summary: Mask What Matters is a controllable text-guided masking framework for self-supervised medical image analysis.<n>It consistently outperforms existing MIM methods, achieving gains of up to +3.1 percentage points in classification accuracy.<n>It achieves these improvements with substantially lower overall masking ratios.
- Score: 2.6554246520306624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40\% vs. 70\%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.
Related papers
- WoundAmbit: Bridging State-of-the-Art Semantic Segmentation and Real-World Wound Care [1.7819099868722776]
Automated wound monitoring via mobile image capture can reduce in-person physician visits by enabling remote tracking of wound size.<n>We benchmark state-of-the-art deep learning models from general-purpose vision, medical imaging, and top methods from public wound challenges.<n>We demonstrate how our AI-driven wound size estimation framework, WoundAmbit, is integrated into a custom telehealth system.
arXiv Detail & Related papers (2025-04-08T16:25:59Z) - AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking [5.844539603252746]
Masked image modeling (MIM) has shown effectiveness by reconstructing randomly masked images to learn detailed representations.
We propose AnatoMask, a novel MIM method that leverages reconstruction loss to dynamically identify and mask out anatomically significant regions.
arXiv Detail & Related papers (2024-07-09T00:15:52Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - AMLP:Adaptive Masking Lesion Patches for Self-supervised Medical Image
Segmentation [67.97926983664676]
Self-supervised masked image modeling has shown promising results on natural images.
However, directly applying such methods to medical images remains challenging.
We propose a novel self-supervised medical image segmentation framework, Adaptive Masking Lesion Patches (AMLP)
arXiv Detail & Related papers (2023-09-08T13:18:10Z) - Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images [39.94162291765236]
We present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map.
We employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Implicit Model (DDIM) at each step of the sampling process.
arXiv Detail & Related papers (2023-08-03T21:56:50Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MPS-AMS: Masked Patches Selection and Adaptive Masking Strategy Based
Self-Supervised Medical Image Segmentation [46.76171191827165]
We propose masked patches selection and adaptive masking strategy based self-supervised medical image segmentation method, named MPS-AMS.
Our proposed method greatly outperforms the state-of-the-art self-supervised baselines.
arXiv Detail & Related papers (2023-02-27T11:57:06Z) - Masked Image Modeling Advances 3D Medical Image Analysis [0.41674286453548476]
Masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data.
This paper shows that MIM can also advance 3D medical images analysis in addition to natural images.
arXiv Detail & Related papers (2022-04-25T15:16:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.