Related papers: Fine-tuning a Multiple Instance Learning Feature Extractor with Masked Context Modelling and Knowledge Distillation

Fine-tuning a Multiple Instance Learning Feature Extractor with Masked Context Modelling and Knowledge Distillation

URL: http://arxiv.org/abs/2403.05325v1
Date: Fri, 8 Mar 2024 14:04:30 GMT
Title: Fine-tuning a Multiple Instance Learning Feature Extractor with Masked Context Modelling and Knowledge Distillation
Authors: Juan I. Pisula and Katarzyna Bozek
Abstract summary: We propose to increase downstream MIL classification by fine-tuning the feature extractor model using itMasked Context Modelling with Knowledge Distillation. A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, while being considerably smaller and requiring a fraction of its compute.
Score: 0.21756081703275998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The first step in Multiple Instance Learning (MIL) algorithms for Whole Slide Image (WSI) classification consists of tiling the input image into smaller patches and computing their feature vectors produced by a pre-trained feature extractor model. Feature extractor models that were pre-trained with supervision on ImageNet have proven to transfer well to this domain, however, this pre-training task does not take into account that visual information in neighboring patches is highly correlated. Based on this observation, we propose to increase downstream MIL classification by fine-tuning the feature extractor model using \textit{Masked Context Modelling with Knowledge Distillation}. In this task, the feature extractor model is fine-tuned by predicting masked patches in a bigger context window. Since reconstructing the input image would require a powerful image generation model, and our goal is not to generate realistically looking image patches, we predict instead the feature vectors produced by a larger teacher network. A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, even capable of outperforming the downstream performance of the teacher model, while being considerably smaller and requiring a fraction of its compute.

Related papers

Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z)
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets. We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Heterogeneous Generative Knowledge Distillation with Masked Image Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision. We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z)
Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z)
Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers. We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation [42.37533586611174]
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances. In this paper, we show that the inferior fine-tuning performance of pre-training approaches can be significantly improved by a simple post-processing.
arXiv Detail & Related papers (2022-05-27T17:59:36Z)
Counterfactual Generative Networks [59.080843365828756]
We propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background. We show that the counterfactual images can improve out-of-distribution with a marginal drop in performance on the original classification task.
arXiv Detail & Related papers (2021-01-15T10:23:12Z)
Multi-task pre-training of deep neural networks for digital pathology [8.74883469030132]
We first assemble and transform many digital pathology datasets into a pool of 22 classification tasks and almost 900k images. We show that our models used as feature extractors either improve significantly over ImageNet pre-trained models or provide comparable performance.
arXiv Detail & Related papers (2020-05-05T08:50:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.