R-MAE: Regions Meet Masked Autoencoders
- URL: http://arxiv.org/abs/2306.05411v2
- Date: Thu, 4 Jan 2024 19:31:50 GMT
- Title: R-MAE: Regions Meet Masked Autoencoders
- Authors: Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald,
Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen
- Abstract summary: We explore regions as a potential visual analogue of words for self-supervised image representation learning.
Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
- Score: 113.73147144125385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we explore regions as a potential visual analogue of words for
self-supervised image representation learning. Inspired by Masked Autoencoding
(MAE), a generative pre-training baseline, we propose masked region
autoencoding to learn from groups of pixels or regions. Specifically, we design
an architecture which efficiently addresses the one-to-many mapping between
images and regions, while being highly effective especially with high-quality
regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent
improvements across various pre-training datasets and downstream detection and
segmentation benchmarks, with negligible computational overheads. Beyond the
quantitative evaluation, our analysis indicates the models pre-trained with
masked region autoencoding unlock the potential for interactive segmentation.
The code is provided at https://github.com/facebookresearch/r-mae.
Related papers
- MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale.
We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Region-Based Representations Revisited [34.01784145403097]
We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2.
The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.
arXiv Detail & Related papers (2024-02-04T05:33:04Z) - Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything.
We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters.
We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Refine and Represent: Region-to-Object Representation Learning [55.70715883351945]
We present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining.
R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks.
After pretraining on ImageNet, R2O models are able to surpass existing state-of-the-art in unsupervised object segmentation.
arXiv Detail & Related papers (2022-08-25T01:44:28Z) - Semantic Segmentation With Multi Scale Spatial Attention For Self
Driving Cars [2.7317088388886384]
We present a novel neural network using multi scale feature fusion at various scales for accurate and efficient semantic image segmentation.
We used ResNet based feature extractor, dilated convolutional layers in downsampling part, atrous convolutional layers in the upsampling part and used concat operation to merge them.
A new attention module is proposed to encode more contextual information and enhance the receptive field of the network.
arXiv Detail & Related papers (2020-06-30T20:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.