Refine and Represent: Region-to-Object Representation Learning
- URL: http://arxiv.org/abs/2208.11821v1
- Date: Thu, 25 Aug 2022 01:44:28 GMT
- Title: Refine and Represent: Region-to-Object Representation Learning
- Authors: Akash Gokul, Konstantinos Kallidromitis, Shufan Li, Yusuke Kato,
Kazuki Kozuka, Trevor Darrell, and Colorado J Reed
- Abstract summary: We present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining.
R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks.
After pretraining on ImageNet, R2O models are able to surpass existing state-of-the-art in unsupervised object segmentation.
- Score: 55.70715883351945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works in self-supervised learning have demonstrated strong performance
on scene-level dense prediction tasks by pretraining with object-centric or
region-based correspondence objectives. In this paper, we present
Region-to-Object Representation Learning (R2O) which unifies region-based and
object-centric pretraining. R2O operates by training an encoder to dynamically
refine region-based segments into object-centric masks and then jointly learns
representations of the contents within the mask. R2O uses a "region refinement
module" to group small image regions, generated using a region-level prior,
into larger regions which tend to correspond to objects by clustering
region-level features. As pretraining progresses, R2O follows a
region-to-object curriculum which encourages learning region-level features
early on and gradually progresses to train object-centric representations.
Representations learned using R2O lead to state-of-the art performance in
semantic segmentation for PASCAL VOC (+0.7 mIOU) and Cityscapes (+0.4 mIOU) and
instance segmentation on MS COCO (+0.3 mask AP). Further, after pretraining on
ImageNet, R2O pretrained models are able to surpass existing state-of-the-art
in unsupervised object segmentation on the Caltech-UCSD Birds 200-2011 dataset
(+2.9 mIoU) without any further training. We provide the code/models from this
work at https://github.com/KKallidromitis/r2o.
Related papers
- Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - R-MAE: Regions Meet Masked Autoencoders [113.73147144125385]
We explore regions as a potential visual analogue of words for self-supervised image representation learning.
Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - Region-Enhanced Feature Learning for Scene Semantic Segmentation [19.20735517821943]
We propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden.
We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage.
Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS datasets with negligible computational cost.
arXiv Detail & Related papers (2023-04-15T06:35:06Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Region Similarity Representation Learning [94.88055458257081]
Region Similarity Representation Learning (ReSim) is a new approach to self-supervised representation learning for localization-based tasks.
ReSim learns both regional representations for localization as well as semantic image-level representations.
We show how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline.
arXiv Detail & Related papers (2021-03-24T00:42:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.