Related papers: Stochastic positional embeddings improve masked image modeling

Stochastic positional embeddings improve masked image modeling

URL: http://arxiv.org/abs/2308.00566v2
Date: Tue, 27 Feb 2024 18:59:14 GMT
Title: Stochastic positional embeddings improve masked image modeling
Authors: Amir Bar, Florian Bordes, Assaf Shocher, Mahmoud Assran, Pascal Vincent, Nicolas Ballas, Trevor Darrell, Amir Globerson, Yann LeCun
Abstract summary: Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. We propose to incorporate location uncertainty into MIM by using positional embeddings (StoP) StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties.
Score: 95.03491875332034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including $+1.7\%$ on ImageNet linear probing using ViT-B, and $+2.5\%$ for ViT-H using $1\%$ of the data.

Related papers

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images. Our model achieves real-time 3D Gaussian reconstruction during inference. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z)
Keypoint Aware Masked Image Modelling [0.34530027457862006]
KAMIM improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers.
arXiv Detail & Related papers (2024-07-18T19:41:46Z)
The Entropy Enigma: Success and Failure of Entropy Minimization [30.083332640328642]
Entropy minimization (EM) is frequently used to increase the accuracy of classification models when they're faced with new data at test time. We analyze why EM works when adapting a model for a few steps and why it eventually fails after adapting for many steps. We present a method for solving a practical problem: estimating a model's accuracy on a given arbitrary dataset without having access to its labels.
arXiv Detail & Related papers (2024-05-08T12:26:15Z)
Pre-training with Random Orthogonal Projection Image Modeling [32.667183132025094]
Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. We propose an Image Modeling framework based on Random Orthogonal Projection Image Modeling (ROPIM) ROPIM reduces spatially-wise token information under guaranteed bound on the noise variance and can be considered as masking entire spatial image area under locally varying masking degrees.
arXiv Detail & Related papers (2023-10-28T15:42:07Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition. Specifically, we utilize the web-collected Coyo-700M dataset. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z)
Image to Sphere: Learning Equivariant Features for Efficient Pose Prediction [3.823356975862006]
Methods that predict a single point estimate do not predict the pose of objects with symmetries well and cannot represent uncertainty. We propose a novel mapping of features from the image domain to the 3D rotation manifold. We demonstrate the effectiveness of our method at object orientation prediction, and achieve state-of-the-art performance on the popular PASCAL3D+ dataset.
arXiv Detail & Related papers (2023-02-27T16:23:19Z)
Uncertainty-Aware Camera Pose Estimation from Points and Lines [101.03675842534415]
Perspective-n-Point-and-Line (Pn$PL) aims at fast, accurate and robust camera localizations with respect to a 3D model from 2D-3D feature coordinates.
arXiv Detail & Related papers (2021-07-08T15:19:36Z)
6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference [67.70859730448473]
We present a multimodal camera relocalization framework that captures ambiguities and uncertainties. We predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction. We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments.
arXiv Detail & Related papers (2020-04-09T20:55:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.