Object-wise Masked Autoencoders for Fast Pre-training
- URL: http://arxiv.org/abs/2205.14338v1
- Date: Sat, 28 May 2022 05:13:45 GMT
- Title: Object-wise Masked Autoencoders for Fast Pre-training
- Authors: Jiantao Wu and Shentong Mo
- Abstract summary: We show that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation.
We introduce a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks.
Experiments on four commonly-used datasets demonstrate the effectiveness of our model in reducing the compute cost by 72% while achieving competitive performance.
- Score: 13.757095663704858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised pre-training for images without labels has recently achieved
promising performance in image classification. The success of transformer-based
methods, ViT and MAE, draws the community's attention to the design of backbone
architecture and self-supervised task. In this work, we show that current
masked image encoding models learn the underlying relationship between all
objects in the whole scene, instead of a single object representation.
Therefore, those methods bring a lot of compute time for self-supervised
pre-training. To solve this issue, we introduce a novel object selection and
division strategy to drop non-object patches for learning object-wise
representations by selective reconstruction with interested region masks. We
refer to this method ObjMAE. Extensive experiments on four commonly-used
datasets demonstrate the effectiveness of our model in reducing the compute
cost by 72% while achieving competitive performance. Furthermore, we
investigate the inter-object and intra-object relationship and find that the
latter is crucial for self-supervised pre-training.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction [6.798515070856465]
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD)
Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR)
arXiv Detail & Related papers (2023-11-08T16:59:26Z) - LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and
Bootstrapped Self-training [13.985488693082981]
We propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks.
We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-22T07:27:09Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Object-Aware Cropping for Self-Supervised Learning [21.79324121283122]
We show that self-supervised learning based on the usual random cropping performs poorly on such datasets.
We propose replacing one or both of the random crops with crops obtained from an object proposal algorithm.
Using this approach, which we call object-aware cropping, results in significant improvements over scene cropping on classification and object detection benchmarks.
arXiv Detail & Related papers (2021-12-01T07:23:37Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.