Cluster and Predict Latent Patches for Improved Masked Image Modeling
- URL: http://arxiv.org/abs/2502.08769v2
- Date: Mon, 17 Feb 2025 09:54:11 GMT
- Title: Cluster and Predict Latent Patches for Improved Masked Image Modeling
- Authors: Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski,
- Abstract summary: We introduce CAPI, a novel pure-MIM framework that relies on the prediction of latent clusterings.
Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties.
Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes.
- Score: 25.616762947410045
- License:
- Abstract: Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.
Related papers
- Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z) - Advancing Cross-Domain Generalizability in Face Anti-Spoofing: Insights, Design, and Metrics [10.631157315662607]
This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization.
One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction.
Our final model outperforms existing state-of-the-art methods across the datasets.
arXiv Detail & Related papers (2024-06-18T04:15:22Z) - MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations [16.885965702357314]
MIM-Refiner is a contrastive learning boost for pre-trained MIM models.
We refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features.
arXiv Detail & Related papers (2024-02-15T16:46:16Z) - Improving Pixel-based MIM by Reducing Wasted Modeling Capability [77.99468514275185]
We propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction.
To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures.
Our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
arXiv Detail & Related papers (2023-08-01T03:44:56Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences.
We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters.
An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z) - Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via
Feature Distillation [42.37533586611174]
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances.
In this paper, we show that the inferior fine-tuning performance of pre-training approaches can be significantly improved by a simple post-processing.
arXiv Detail & Related papers (2022-05-27T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.